# MLE: an information theory viewpoint

We show that the MLE is obtained by minimizing the KL-divergence from an empirical distribution and interpret what it means.

The main tool for our information theory viewpoint is the Kullback-Leibler divergence.

Let

be $\sn$ i.i.d. random variables with distribution

where the parameter $\theta$ is unknown and supposed in the set $\Theta$. Note

an observation from these random variables. The joint distribution is:

We will show that the MLE $\hat{\theta}$ of $\theta$ minimizes the divergence of our estimate from the empirical distribution $\fp_\text{emp}$.

## Divergence from the empirical distribution

Let $\vx$ an observation and $\fp_{\text{emp}}$ the corresponding empirical distribution. In this section we show that:

Since the sample is i.i.d., we have:

For $\sa > 0$ the function $\sx \mapsto \sa\sx-\sb$ is incraesing and independent of $\st$:

Define $\sn(\sx_{\si})$ as the number of occurences of $\sx_{\si}$ in the components of $\vx$. Take $\sa = \frac{1}{\sn}$ and $\sb = \sum_{\si = 1}^{\sn} \ln\paren{\frac{\sn(\sx_{\si})}{\sn}}$. We have:

And now we regroup the indices $\si$ and $\sj$ if $x_\si = \sx_\sj$ (there are $\sn(\sx_{\si})$ of them). Let $\sets = \{\sx_{\si} \mid \si \leq \sn\}$:

## MLE from an information theory viewpoint

The KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution $q$ to encode the data instead of the true distributino $p$.

### Divergence from empirical distribution

For some value $\sy$, define $\sn(\sy)$ as the number of times that $\sy$ appears in $\vx$:

The empirical distribution $\def\femp{f_\text{emp}} \femp$ is the probability distribution that weigths $\frac{1}{n}$ for each occurences in $\vx$:

And define the joint empirical distribution as:

Let’s use the following notation:

The Kullback-Leibler divergence of $f_{\vert\theta}$ from $\femp$ is minimized by $f_{\vert\hat{\theta}}$ where $\hat{\theta}$ is the MLE of $\theta$. In other words:

### Divergence from the true distribution

According to our model, $f_{\vert\theta}(\vx) = f_{\rvx}(\vx \mid \theta)$ is the true distribution that generated our data. Assuming the observations are i.i.d., we have another divergence result: the MLE minimizes the Kullback-Leibler divergence between our estimated distribution $f_{\vert\hat{\theta}}$ and the true distribution $f_{\vert\theta}$: