MLE: an information theory viewpoint

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

We show that the MLE is obtained by minimizing the KL-divergence from an empirical distribution and interpret what it means.

The main tool for our information theory viewpoint is the Kullback-Leibler divergence.

Let

$\rvx = (\rx_1, \dotsc, \rx_{\sn})$

be $\sn$ i.i.d. random variables with distribution

$\fp_{\rx\mid\theta}(\sx \mid \theta)$

where the parameter $\theta$ is unknown and supposed in the set $\Theta$ . Note

$\vx = (\sx_1, \dotsc, \sx_{\sn})$

an observation from these random variables. The joint distribution is:

$\fp_{\rvx\mid\theta}(\vx \mid \theta) = \prod_{\si = 1}^{\sn}\fp_{\rx\mid\theta}(\sx_{\si} \mid \theta)$

We will show that the MLE $\hat{\theta}$ of $\theta$ minimizes the divergence of our estimate from the empirical distribution $\fp_\text{emp}$ .

Divergence from the empirical distribution

Let $\vx$ an observation and $\fp_{\text{emp}}$ the corresponding empirical distribution. In this section we show that:

$\hat{\theta} = \argmin_{\st \in \Theta} \dkl{\fp_{\text{emp}}}{\fp_{\rvx\mid\st}}$

Since the sample is i.i.d., we have:

$\hat{\theta} = \argmax \sum_{\si = 1}^{\sn} \ln \fp_{\rx\mid\theta}(\sx_{\si} \mid \st)$

For $\sa > 0$ the function $\sx \mapsto \sa\sx-\sb$ is incraesing and independent of $\st$ :

$\hat{\theta} = \argmax \sa \sum_{\si = 1}^{\sn} \ln \fp_{\rx\mid\theta}(\sx_{\si} \mid \st) - \sb$

Define $\sn(\sx_{\si})$ as the number of occurences of $\sx_{\si}$ in the components of $\vx$ . Take $\sa = \frac{1}{\sn}$ and $\sb = \sum_{\si = 1}^{\sn} \ln\paren{\frac{\sn(\sx_{\si})}{\sn}}$ . We have:

$\begin{align*} & \argmax \sum_{\si = 1}^{\sn} \sa \ln \fp_{\rx\mid\theta}(\sx_{\si} \mid \st) - \sb \\ = & \argmax \sum_{\si = 1}^{\sn} \frac{1}{\sn} \ln\paren{ \frac{ \fp_{\rx\mid\theta}(\sx_{\si} \mid \st) }{ \sn(\sx_{\si}) / \sn } } \end{align*}$

And now we regroup the indices $\si$ and $\sj$ if $x_\si = \sx_\sj$ (there are $\sn(\sx_{\si})$ of them). Let $\sets = \{\sx_{\si} \mid \si \leq \sn\}$ :

$\begin{align*} \hat{\theta} = & \argmax \sum_{\sx \in \sets} \frac{ \sn(\sx) }{ \sn } \ln\paren{ \frac{ \fp_{\rx\mid\theta}(\sx \mid \st) }{ \sn(\sx) / \sn } } \\ = & \argmax \sum_{\sx \in \sets} \fp_\text{emp}(\sx) \ln\paren{ \frac{ \fp_{\rx\mid\theta}(\sx \mid \st) }{ \fp_\text{emp}(\sx) } } \\ = & \argmin \sum_{\sx \in \sets} \fp_\text{emp}(\sx) \ln\paren{ \frac{ \fp_\text{emp}(\sx) }{ \fp_{\rx\mid\theta}(\sx \mid \st) } } \\ = & \argmin_{\st \in \Theta} \dkl{ \fp_\text{emp} }{ \fp_{\rvx\mid\st} } \end{align*}$

MLE from an information theory viewpoint

The KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution $q$ to encode the data instead of the true distributino $p$ .

Divergence from empirical distribution

For some value $\sy$ , define $\sn(\sy)$ as the number of times that $\sy$ appears in $\vx$ :

$\sn(\sy) = \text{card}\{\si \mid \sy_i = \sy\}$

The empirical distribution $\def\femp{f_\text{emp}} \femp$ is the probability distribution that weigths $\frac{1}{n}$ for each occurences in $\vx$ :

$\femp(\sy) = \frac{\sn(\sy)}{\sn}$

And define the joint empirical distribution as:

$\femp(\vy) = \prod_{\si = 1}^{\sn} \femp(\sy_\si)$

Let’s use the following notation:

$f_{\vert\st}(\vx) = f_{\rvx}(\vx \mid \theta = \st)$

The Kullback-Leibler divergence of $f_{\vert\theta}$ from $\femp$ is minimized by $f_{\vert\hat{\theta}}$ where $\hat{\theta}$ is the MLE of $\theta$ . In other words:

$\hat{\theta} = \argmin_{\st \in \Theta} \dkl{\femp}{f_{\vert\st}}$

Divergence from the true distribution

According to our model, $f_{\vert\theta}(\vx) = f_{\rvx}(\vx \mid \theta)$ is the true distribution that generated our data. Assuming the observations are i.i.d., we have another divergence result: the MLE minimizes the Kullback-Leibler divergence between our estimated distribution $f_{\vert\hat{\theta}}$ and the true distribution $f_{\vert\theta}$ :

$\hat{\theta} = \argmin_{\st \in \Theta} \dkl{f_{\vert\theta}}{f_{\vert\st}}$