MLE: an information theory viewpoint

We show that the MLE is obtained by minimizing the KL-divergence from an empirical distribution and interpret what it means.

The main tool for our information theory viewpoint is the Kullback-Leibler divergence.

Let

be i.i.d. random variables with distribution

where the parameter is unknown and supposed in the set . Note

an observation from these random variables. The joint distribution is:

We will show that the MLE of minimizes the divergence of our estimate from the empirical distribution .

Divergence from the empirical distribution

Let an observation and the corresponding empirical distribution. In this section we show that:

Since the sample is i.i.d., we have:

For the function is incraesing and independent of :

Define as the number of occurences of in the components of . Take and . We have:

And now we regroup the indices and if (there are of them). Let :

MLE from an information theory viewpoint

The KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution to encode the data instead of the true distributino .

Divergence from empirical distribution

For some value , define as the number of times that appears in :

The empirical distribution is the probability distribution that weigths for each occurences in :

And define the joint empirical distribution as:

Let’s use the following notation:

The Kullback-Leibler divergence of from is minimized by where is the MLE of . In other words:

Divergence from the true distribution

According to our model, is the true distribution that generated our data. Assuming the observations are i.i.d., we have another divergence result: the MLE minimizes the Kullback-Leibler divergence between our estimated distribution and the true distribution :