Primer on stochastic convergence

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

Types of convergence

Convergence in distribution

Let $(\distrib{F}_n)_{n \geq 1}$ a sequence of distribution functions and let $\distrib{G}$ a distribution function with same domain. Let $\mathcal{C}(\distrib{G})$ the set of continuity points of $\distrib{G}$ . We say that $(\distrib{F}_n)_n$ converge in distribution to $\distrib{G}$ :

$\distrib{F}_n \dconv \distrib{G}$

When for all continuity point $\vobs{y} \in \mathcal{C}(\distrib{G})$ :

$\lim_{n \to \infty} \distrib{F}_n(\vobs{y}) \to \distrib{G}(\vobs{y})$

Relation to functional analysis: convergence in distribution is pointwise convergence of the distribution functions on the set of continuity points.

By abuse of notation, we extend this definition to sequence of random variables/vectors $(\rve{Y}_n)_{n \geq 1}$ :

$\rve{Y}_n \dconv \rve{Y} \iff \distrib{F}_{\rve{Y}_n} \dconv \distrib{F}_{\rve{Y}}$

Convergence in probability

Let $(\rve{Y}_n)_{n \geq 1}$ a sequence of random vector. We say that it converges in probability to the random vector $\rve{Y}$ :

$\rve{Y}_n \pconv \rve{Y}$

When for all $\epsilon > 0$ we have:

$\prob\brak{\norm{\rve{Y}_n - \rve{Y}} > \epsilon} \stackrel{n \to \infty}{\to} 0$

Since a random variable $\rva{Y}$ is a random vector of dimension $1$ , for random variable the condition is writen:

$\prob\brak{\abs{\rva{Y}_n - \rva{Y}} > \epsilon} \stackrel{n \to \infty}{\to} 0$

Difference between p and d convergence

$d$ -convergence relates distribution functions. It says the probabilistic behavior of a sequence $\rva{Y}_n$ becomes more and more alike to that of the limit $\rva{Y}$ .
$p$ -convergence relates random variables. It says the actual realisations of $\rva{Y}_n$ can be progressively approximated with high probability by those of $\rva{Y}$ .
$p$ -convergence implies $d$ -convergence.
$d$ -convergence does not imply $p$ -convergence.

Example: let $\rva{Z} \distributed \gaussian(0, 1)$ . We have:

$\paren{\frac{1}{n} - Z} \dconv Z$

but

$\paren{\frac{1}{n} - Z} \pconv -Z$

There is a partial converse when the limit is a constant $c \in \realset$ :

$\rva{Y}_n \dconv \const{c} \implies \rva{Y}_n \pconv \const{c}$

Bonus: Cramer-Wold Device

As a side note, there is a link between univariate and multivariate $d$ -convergence:

Let $\rve{Y}_n$ be a sequence of random vectors of $\realvset{d}$ and $\rve{Y}$ a random vector. For any constant vector $\vec{u} \in \realvset{d}$ , the random variable $(\vec{u} \cdot Y_n)$ is univariate. We have:

$\rve{Y}_n \dconv \rve{Y} \iff \forall \vec{u} \in \realvset{d}, \, \vec{u}\cdot\rve{Y}_n \dconv \vec{u}\cdot\rve{Y}$

Fundamental convergence theorems

Law of large numbers

Let $(\rve{Y}_k)_{k\geq 1}$ be a sequence of indepent random vectors with $\expectation\brak{\rve{Y}_k} = \vec{\mu}$ and $\expectation\norm{\rve{Y}_k} < \infty$ for all $n \geq 1$ . Then:

$\paren{\frac{1}{n}\sum_{k = 1}^{n} \rve{Y}_k} \pconv \vec{\mu}$

Interpretation: since it is $p$ -convergence, this means that as the sample size $n$ increases, there is higher and higher probability that the value of the sample average: $\avg\{\rve{Y}_k \mid k \leq n\}$ is a good approximation to the mean $\vec{\mu}$ .

But what is the uncertainty associated with this approximation? Under slightly stronger assumptions on the sequence, the following theorem is the answer.

Central limit theorem

Let $(\rve{Y}_k)_{k \geq 1}$ be an i.i.d. sequence of random vectors with mean $\vec{\mu}$ and covariance matrix $\Omega$ . Then:

$\sqrt{n} \cdot \paren{\frac{1}{n}\sum_{k = 1}^{n} \rve{Y}_k} \dconv \gaussian_d\paren{\vec{\mu}, \Omega}$

When the dimension is $1$ , the covariance matrix reduces to the variance $\sigma^2$ and the theorem reads:

Let $(\rva{Y}_k)_{k \geq 1}$ be an i.i.d. sequence with mean $\mu$ and variance $\sigma^2 < \infty$ . Then:

$\sqrt{n} \cdot \paren{\frac{1}{n}\sum_{k = 1}^{n} \rva{Y}_k} \dconv \gaussian\paren{\mu, \sigma^2}$

Interpretation: as the sample size $n$ increases, the distribution of the sample average is a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ :

$\paren{\frac{1}{n}\sum_{k = 1}^{n} \rva{Y}_k} \approx \gaussian\paren{\mu, \frac{\sigma^2}{n}}$

Notice that the standard deviation shrinks at the speed of $\frac{1}{\sqrt{n}}$ .

Weighted sum central limit theorem

A more general version of the CLT is often useful when combined with the tools presented in the next section.

Let $(\rva{Y}_k)_{k \geq 1}$ be an i.i.d. sequence of real random variables with common mean $\expectation\brak{\rva{Y}_k} = 0$ and variance $\sigma^2 = 1$ . Let $(a_i)_{i \geq 1}$ be a sequence of real constants.

Use the following notations: $\vec{a}_{n} = (a_1, \dotsc, a_n)$ , and $\rve{Y}_{n} = (\rva{Y}_1, \dotsc, \rva{Y}_n)$ .

If, in the limit, any single component contributes a negligible proportion of the total variance, i.e:

$\sup_{1 \leq i \leq n} \frac{a_i^2}{\norm{\vec{a}_{n}}} \stackrel{n \to \infty}{\to} 0$

Then:

$\frac{\vec{a_{n}} \cdot \rve{Y}_{n}}{\norm{\vec{a}_{n}}} \,\dconv\,\gaussian(0, 1)$

Setting $a_i = 1$ for all $i$ yields the previous univariate central limit theorem.

New approximations from old ones

These theorems are used to approximate complicated distributions by simpler ones. Here are some transformation results that let us obtain new approximations from the old ones.

Continuous mapping theorem

Let $\fun{g}: \realset \to \realset$ be a continuous function. Then:

$\begin{align*} & \rva{Y}_n \pconv \rva{Y} \implies \fun{g}(\rva{Y}_n) \pconv \fun{g}(\rva{Y}) \\ & \rva{Y}_n \dconv \rva{Y} \implies \fun{g}(\rva{Y}_n) \dconv \fun{g}(\rva{Y}) \end{align*}$

Slutsky’s theorem

Let $\fun{g}: \realset \times \realset \to \realset$ be a continuous function and $(\rva{Y}_n)_n$ , $(\rva{X}_n)_n$ two sequences of random variables and $c \in \realset$ a constant. Then:

$\begin{cases} \rva{X}_n &\dconv \rva{X} \\ \rva{Y}_n &\dconv \const{c} \end{cases} \implies \fun{g}\paren{\rva{X}_n, \rva{Y}_n} \dconv \fun{g}\paren{\rva{X}, c}$

The continuous mapping theorem would be applicable if the joint-distribution of $\rve{Z}_n = (\rva{X}_n, \rva{Y}_n)$ $d$ -converged to that of $\rve{Z} = (\rva{X}, c)$ . But Slutsky’s theorem is a stronger result because we only assume marginal convergence.

The delta method

Let $\rve{Z}_n = a_n\,(\rve{X}_n - \vec{u}) \dconv \rve{Z}$ where $\vec{u} \in \realvset{d}$ , $a_n \in \realset$ and $a_n \to \infty$ . Let $\fun{g}: \realvset{d} \to \realvset{p}$ be continuously differentiable at point $\vec{u}$ . Then:

$a_n\,(\fun{g(\rve{X}) - g(\vec{u})}) \dconv \frac{\partial}{\partial \vec{u}}\fun{g}(\vec{u})\cdot\rve{Z}$

Where $\frac{\partial}{\partial \vec{u}}\fun{g}(\vec{u})$ is the derivative of $\fun{g}$ at point $\vec{u}$ . When the dimension is $1$ , this is the usual derivative $\fun{g}'(u)$ , otherwise this is the jacobian matrix.