Loading [MathJax]/jax/output/SVG/jax.js

Introduction to statistical estimators

Nov 13, 2018

In this article we define what an estimator is. We focus on the theory to compare and assess estimators, rather than how to find one.

Note: estimators are statistics, so I suggest you read our dedicated article on statistics first.

Context

In a typical inference situation, we dispose of a sample of n observations:

x=(x1,,xn)

We model this sample as observations of a random variable X=(X1,,Xn) whose source is some probability distribution F(Xθ) that depends on some unknown parameter θ.

Point estimators

The purpose of an estimator ˆθ(X) is to use the observed sample to estimate the true value of θ.

Since an estimator is a function of the sample, it is a statistic.

Definition: point estimator
Let Θ the range of possible values for θ. A point estimator of θ is a statistic ˆθ taking values in Θ:
xX,ˆθ(x)Θ

Don’t confuse the notations: θ is a fixed value while ˆθ=ˆθ(X) is a random variable and ˆθ(x) is an observation of this random variable.

Constistency

This definition is very large and clearly not every estimator are interesting. Let’s narrow it down.

Definition: consistent estimator
A point estimator ˆθ of θ is consistent if it converges to θ when the sample size n increases:
ˆθ(x)=ˆθ(x1,,xn)nθ

Precision of an estimator

To measure the precision of an estimator, we can use the mean squared-error:

Definiton: mean squared-error
The mean squared-error of an estimator is the squared-distance between the estimate and the true value of the parameter:
MSE(ˆθ,θ)=EX[ˆθ(X)θ22]

Which can be used to bound the concentration of ˆθ around the true value θ:

P[ˆθθ2>ϵ]MSE(ˆθ,θ)ϵ2

If MSE(ˆθ,θ) converges towards 0 when n increases, the estimator is consistent. But we can find consistent estimators for which the MSE does not converge towards 0.

So, how small can we make the MSE? Before we answer this question, it will be usefull to use the bias-variance decomposition.

Definition: bias-variance decomposition
The bias-variance decomposition expresses the MSE loss in terms of the bias and the variance of the estimator:
MSE(ˆθ,θ)=E[ˆθ]θ22bias2+E[E[ˆθ]ˆθ22]bias

Which explains why unbiased estimators are so popular. Let’s turn our attention to such estimators.

Bias

Definition: unbiased estimator
An estimator ˆθ(X) is unbiased when:
EX[^θ(X)]=θ

Although unbiased estimators are convenient, always remember that a biased low-variance estimators can be preferable to unbiased high-variance ones. Moreover, biased estimators can be consistent if the bias decreases when n increases.

What about the variance term, can we make it as small as we want?

Variance

We do have a lower bound on the variance of unbiased estimators:

Cramér-Rao lower bound
Given some regularity conditions, any unbiased estimator ˆθ(X) of finite variance satisfies:
var[ˆθ]1In(θ)

Where In(θ) is the Fisher information.

Can we achieve this bound?

Proprosition
var[ˆθ(X)] attains the Cramér-Rao lower bound if and only if the density of X is a one-parameter exponential family with sufficient statistic ˆθ

And if we can’t achieve it, how can we improve our estimator? The following theorem tells us that in order to reduce the variance of our estimator, we should throw away irrelevant aspects of the data.

Rao-Blackwell theorem
Let ˆθ be an unbiased estimator of θ with finite variacne, and let T=T(X) be a sufficient statistic for θ. Then ˆθ=E[ˆθT] is also an unbiased estimator of θ and:
var[ˆθ]varˆθ

Equality is attained when: P[ˆθ=ˆθ]=1.

Recall that a statistic T contains information than a statistics S when there exists a function g such that: T=g(S).

The following theorem tells us that the more we throw away irrelevant information, the lower the variance of our estimator:

Let ˆθ be an unbiased estimator, and T and S two sufficient statistics. If there exists a function g such that T=g(S), then:

var[E[ˆθT]]var[E[ˆθS]]

So the best we can do is use a minimally sufficient statistic.

Estimators in practice

Common estimators are:

  • the maximum likelihood estimator which maximizes fX(xˆθ);
  • the maximum a posterior estimator which maximizes fθ(ˆθX=x);
  • the method of moment estimator which approximates E[X] with ˉX.