Introduction to statistical estimators

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

In this article we define what an estimator is. We focus on the theory to compare and assess estimators, rather than how to find one.

Note: estimators are statistics, so I suggest you read our dedicated article on statistics first.

Context

In a typical inference situation, we dispose of a sample of $\sn$ observations:

$\vx = (\sx_1, \dotsc, \sx_\sn)$

We model this sample as observations of a random variable $\rvx = (\rx_1, \dotsc, \rx_{\sn})$ whose source is some probability distribution $F(\rvx \mid \theta)$ that depends on some unknown parameter $\theta$ .

Point estimators

The purpose of an estimator $\hat{\theta}(\rvx)$ is to use the observed sample to estimate the true value of $\theta$ .

Since an estimator is a function of the sample, it is a statistic.

Definition: point estimator: Let $\Theta$ the range of possible values for $\theta$ . A point estimator of $\theta$ is a statistic $\hat{\theta}$ taking values in $\Theta$ :

$\forall \vx \sim \rvx, \quad \hat{\theta}(\vx) \in \Theta$

Don’t confuse the notations: $\theta$ is a fixed value while $\hat{\theta} = \hat{\theta}(\rvx)$ is a random variable and $\hat{\theta}(\vx)$ is an observation of this random variable.

Constistency

This definition is very large and clearly not every estimator are interesting. Let’s narrow it down.

Definition: consistent estimator: A point estimator $\hat{\theta}$ of $\theta$ is consistent if it converges to $\theta$ when the sample size $n$ increases:

$\hat{\theta}(\vx) = \hat{\theta}\paren{\sx_1, \dotsc, \sx_n} \stackrel{n \to \infty}{\to} \theta$

Precision of an estimator

To measure the precision of an estimator, we can use the mean squared-error:

Definiton: mean squared-error: The mean squared-error of an estimator is the squared-distance between the estimate and the true value of the parameter:

$\text{MSE}(\hat{\theta}, \theta) = \expectation_{\rvx}\brak{\normtwo{\hat{\theta}(\rvx) - \theta}^2}$

Which can be used to bound the concentration of $\hat{\theta}$ around the true value $\theta$ :

$\prob\brak{\normtwo{\hat{\theta} - \theta} > \epsilon} \leq \frac{\text{MSE}(\hat{\theta}, \theta)}{\epsilon^2}$

If $\text{MSE}(\hat{\theta}, \theta)$ converges towards $0$ when $\sn$ increases, the estimator is consistent. But we can find consistent estimators for which the MSE does not converge towards $0$ .

So, how small can we make the $\text{MSE}$ ? Before we answer this question, it will be usefull to use the bias-variance decomposition.

Definition: bias-variance decomposition: The bias-variance decomposition expresses the MSE loss in terms of the bias and the variance of the estimator:

$\text{MSE}(\hat{\theta}, \theta) = \underbrace{\normtwo{\expectation[\hat{\theta}] - \theta}^2}_{\text{bias}^2} + \underbrace{\expectation\brak{\normtwo{\expectation[\hat{\theta}]- \hat{\theta}}^2}}_{\text{bias}}$

Which explains why unbiased estimators are so popular. Let’s turn our attention to such estimators.

Bias

Definition: unbiased estimator: An estimator $\hat{\theta}(\rvx)$ is unbiased when:

$\expectation_{\rvx}[\hat{\theta(\rvx)}] = \theta$

Although unbiased estimators are convenient, always remember that a biased low-variance estimators can be preferable to unbiased high-variance ones. Moreover, biased estimators can be consistent if the bias decreases when $\sn$ increases.

What about the variance term, can we make it as small as we want?

Variance

We do have a lower bound on the variance of unbiased estimators:

Cramér-Rao lower bound: Given some regularity conditions, any unbiased estimator $\hat{\theta}(\rvx)$ of finite variance satisfies:

$\var[\hat{\theta}] \geq \frac{1}{\mathcal{I}_n(\theta)}$

Where $\mathcal{I}_n(\theta)$ is the Fisher information.

Can we achieve this bound?

Proprosition: $\var[\hat{\theta}(\rvx)]$ attains the Cramér-Rao lower bound if and only if the density of $\rvx$ is a one-parameter exponential family with sufficient statistic $\hat{\theta}$

And if we can’t achieve it, how can we improve our estimator? The following theorem tells us that in order to reduce the variance of our estimator, we should throw away irrelevant aspects of the data.

Rao-Blackwell theorem: Let $\hat{\theta}$ be an unbiased estimator of $\theta$ with finite variacne, and let $T = T(\rvx)$ be a sufficient statistic for $\theta$ . Then $\hat{\theta}^* = \expectation[\hat{\theta} \mid T]$ is also an unbiased estimator of $\theta$ and:

$\var[\hat{\theta}^*] \leq \var{\hat{\theta}}$

Equality is attained when: $\prob[\hat{\theta}^* = \hat{\theta}] = 1$ .

Recall that a statistic $T$ contains $\leq$ information than a statistics $S$ when there exists a function $g$ such that: $T = g(S)$ .

The following theorem tells us that the more we throw away irrelevant information, the lower the variance of our estimator:

Let $\hat{\theta}$ be an unbiased estimator, and $T$ and $S$ two sufficient statistics. If there exists a function $g$ such that $T = g(S)$ , then:

$\var\brak{\expectation[\hat{\theta} \mid T]} \leq \var\brak{\expectation[\hat{\theta} \mid S]}$

So the best we can do is use a minimally sufficient statistic.

Estimators in practice

Common estimators are:

the maximum likelihood estimator which maximizes $f_\rvx(\vx \mid \hat{\theta})$ ;
the maximum a posterior estimator which maximizes $f_\theta(\hat{\theta} \mid \rvx = \vx)$ ;
the method of moment estimator which approximates $\expectation[\rvx]$ with $\bar{\rx}$ .