The maximum likelihood estimator is one of the most used estimators in statistics. In this article, we introduce this estimator and study its properties.
In a typical inference task, we have some data (x1,…,xn) that we wish to understand better. The statistical approach is to model the source of these data as a random variable →X=(X1,…,Xn) whose outcomes are produced with joint-probability f→X(→x∣θ) where θ∈Θ is an unknown parameter.
Definitions
A maximum likelihood estimator for θ is an estimator that maximizes the probability of producing the sample we observed.
- Definition: likelihood
- The likelihood is the probability f→X seen as a function of θ:
- Definition: MLE
- When the likelihood admits a unique global maximum, the MLE ˆθ is:
In practice, we often maximize the log-likelihood instead of the likelihood. Since ln is an increasing function, this yields an equivalent solution.
The log-likelihood is noted l:
l→x(θ)=ln∘L→x(θ)=ln∘f→X(→x∣θ)Remarks:
- the likelihood is not the probability of θ;
- maximizing the probability of θ is called “maximum a posteriori estimation”.
Estimator performance
As explained in our primer on estimators, we first want to know if the MLE is consistent.
Consistency
Under some regularity conditions on the density f→X, the MLE is a consistent estimator, for instance:
- when θ∈Rd and L→x(θ) is concave;
- when θ∈R and L→x(θ) is continuously differentiable;
- when f→X(→x∣θ) is from a k-parameter exponential family.
Asymptotic performance
Assuming an i.i.d. sample and under sufficient regularity of the distribution f→X, the MLE has excellent asymptotical properties:
- Theorem
- For i.i.d. samples with sufficient regularity and assuming consistency, the asymptotic distribution of the MLE is:
Where:
I1(θ)=E[−ddθlx1(θ)]is the Fisher information.
So, for large sample sizes n:
- it is approximately normally distributed;
- approximately unbiased;
- approximately achieves the Cramer-Rao lower bound.
…What else?
What are those regularity conditions?
- Θ is an open subset of R (so that it always make sense for an estimator to have symmetric distribution around θ).
- The support of f→X is independent of θ (so that we can interchange integration and differentiation).
- L→x∈C3.
- E[l′xi(θ)]=0 and var[l′xi(θ)]=I1(θ)>0.
- −E[l″xi(θ)]=I1(θ)>0.
- ∃m(x)>0 and δ>0 such that Eθ[m(Xi)]<∞ and:
Other properties
The MLE is equivariant, which is very convenient in practice.
- Proposition: Equivariance of the MLE
- MLEs are equivariant: let g:Θ→Θ′ a bijection. If ˆθ is the MLE of θ, then g(ˆθ) is the MLE of g(θ):