What is a statistic and why do we care?

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

In this article, we explain that a statistic is a way of compressing information contained in the data, and we show how it can be used for inference.

Let $\def\source{\mathcal{Y}} \def\sourcevec{\vec{\source}} \def\obs{y} \def\obsvec{\vec{\obs}} \def\param{\theta} \def\Param{\Theta} \newcommand{\est}[1]{\hat{#1}}$ $\sourcevec = (\source_1, ..., \source_n)$ be a random vector. Suppose the joint distribution of $\sourcevec$ is $F(\obsvec ; \param)$ for some unknown parameter $\param \in \Param$ .

We observe a sample $\obsvec = (\obs_1, ..., \obs_n)$ drawn from $\source$ . What conclusions about $\param$ can we make on the sole basis of our observations $\obsvec$ ? And what is the uncertainty associated with these conclusions?

We will study the sample $\obsvec$ through numerical summaries $T(\obsvec)$ . Such a summary is called a statistic.

Definition: statistic: A statistic is any function $T$ of the sample that does not depend on the unknown parameters $\param$ . For example, the sample average $\avg(\obsvec) = \frac{1}{n}\sum \obs_i$ is a statistic.

To understand what good a given statistic $T$ is, we need to understand its behavior when the parameter $\param$ changes. While $T(\obsvec)$ is a fixed number associated with the fixed observation $\obsvec$ , we have that $T(\sourcevec)$ is a random variable. To understand how the statistic $T$ behaves when $\param$ changes, we need to study this random variable.

Definition: sampling distribution: The sampling distribution of $T$ under the distribution $F(\obsvec ; \param)$ of $\sourcevec$ is the distribution of the random variable $T(\sourcevec)$ :

$F_T(t) = \prob\brak{T(\sourcevec) \leq t}$

The key observation here is that the sampling distribution of $T$ depends on the unknown parameter $\param$ . The more it depends on $\param$ , the more information $T$ conveys about it.

The result $T(\sourcevec)$ of a deterministic transformation $T$ applied to $\sourcevec$ can not convey more information than $\sourcevec$ . So it is a form of compression. How much we can compress the sample without loosing interesting information about $\param$ ?

Let’s define a name for statistics that carry no information about the parameter.

Definition: ancillary statistic: A statistic $T$ is ancillary for the parameter $\param$ if its sampling distribution does not functionally depend on $\param$ . Consequence: such statistics carry no information about $\param$ .

So, what information is lost when we use $T$ to compress the sample? To answer this question, we need to understand what different samples $\obsvec_1$ and $\obsvec_2$ are compressed into the same value $t = T(\obsvec_1) = T(\obsvec_2)$ .

Definition: level set: The level sets of $T$ are the sets: $\newcommand{\levelset}[1]{L_#1}$

$\levelset{t} = T^{-1}(t) = \{\obsvec \sim \sourcevec \mid T(\obsvec) = t\}$

This sets are of interest because all the observations of $\sourcevec$ that falls in a given level set $\levelset{t}$ are equivalent as far as $T$ is concerned. They all reduce to the same value $t$ .

Let’s look at the distribution $F_{\sourcevec \mid T = t}$ of $\sourcevec$ conditional on a given level set $\levelset{t}$ of $T$ .

When $F_{\sourcevec \mid T = t}$ changes depending on $\param$ , we are loosing the information conveyed by this dependence.
When $F_{\sourcevec \mid T = t}$ is functionally independent of $\param$ , then $\sourcevec$ contains no information about $\param$ on the set $\levelset{t}$ and we are not loosing any information on this set.
If this is true for all possible values $t$ of $T(\obsvec)$ , then our statistic contains the same information about $\param$ as $\obsvec$ itself does. In other words, knowing the exact value of $\obsvec$ does not convey more information than knowing $T(\obsvec)$ . Let’s define a name for this.

Definition: sufficient statistic: A statistic $T$ is said to be sufficient for the parameter $\param$ if $F_{\sourcevec \mid T(\sourcevec) = t}$ does not depend on $\param$ .

Example: coin tossing: We model $n$ toss of a biased coin using an i.i.d. sample from the $\mathrm{Bernoulli}(\param)$ distribution, where the probability $\param$ to obtain head is unknown. Let $T(\obsvec) = \sum_{i = 1}^{n} \obs_i$ be the number of heads among the $n$ toss.

$\prob[\sourcevec = \obsvec \mid T = t] = \begin{cases}{n \choose t}^{-1} & \text{if } \sum_{i = 1}^{n} \obs_i = t \\ 0 & \text{otherwise}\end{cases}$

And we see that $T$ is sufficient for $\param$ : knowing which tosses came heads is irrelevant in deciding the probability of head. Only the number of observed heads matters.

While sufficient statistics are incredibly usefull, the definition is hard to verify in practice. The Fisher-Neyman factorization theorem provides an easier way to identify sufficient statistics.

Fisher-Neyman factorization theorem: Let $\sourcevec$ be a random vector with joint density function $f(\obsvec;\param)$ . A statistic $T$ is sufficient for $\param$ if and only if there exists functions $g$ and $h$ such that:

$f(\obsvec; \param) = g\paren{T(\obsvec), \param}\,h(\obsvec)$

So, sufficient statistics compress data without information loss about the parameter $\param$ of interest. Still, some sufficient statistic might contain more data than necessary. How much can we compress?

Definition: minimally sufficient statistic: A statistic $T$ is said to be minimally sufficient for the parameter $\param$ if it is sufficient for $\param$ and for any other sufficient statistic $S$ there exists a function $g(\cdot)$ such that:

$T(\sourcevec) = g\paren{S(\sourcevec)}$

Since the deterministic function $g$ can only reduce the amount of conveyed information and not increase it, we see that $T$ is the sufficient statistic that contains the less information.

So, statistics compress the sample and contain information about the unknown parameter. How do we retrieve this parameter? We use a point estimator.

Let’s see an example.

Gaussian Sufficient Statistics

Let a sample $\sourcevec \iid \gaussian(\mu, \sigma^2)$ of size $n$ . Define the following statistics:

$\begin{align*} \avg(\sourcevec) &= \frac{1}{n} \sum_{i = 1}^{n} \source_i \\ S^2(\sourcevec) &= \frac{1}{n-1} \sum_{i = 1}^{n} \paren{\source_i - \avg(\source)}^2 \end{align*}$

The pair $(\avg(\sourcevec), S^2(\sourcevec))$ is minimally sufficient for $(\mu, \sigma^2)$ and we have:

$\begin{align*} \avg(\sourcevec) &\sim \gaussian(\mu, \sigma^2 / n) \\ \frac{n-1}{\sigma^2}S^2 &\sim \mathcal{X}^2_{n-1} \end{align*}$

Using convergence results, we can conclude that as the sample size $n$ increases, $\avg(\obsvec)$ converges to $\mu$ at the speed of $\mathcal{O}\paren{\frac{1}{\sqrt{n}}}$ . Likewise, $S^2(\obsvec)$ converges to $\sigma^2$ :

$\begin{align*} \avg(\sourcevec) &\to \mu \\ S^2(\sourcevec) &\to \sigma^2 \\ \end{align*}$