In this article, we explain that a statistic is a way of compressing information contained in the data, and we show how it can be used for inference.
Let be a random vector. Suppose the joint distribution of is for some unknown parameter .
We observe a sample drawn from . What conclusions about can we make on the sole basis of our observations ? And what is the uncertainty associated with these conclusions?
We will study the sample through numerical summaries . Such a summary is called a statistic.
- Definition: statistic
- A statistic is any function of the sample that does not depend on the unknown parameters . For example, the sample average is a statistic.
To understand what good a given statistic is, we need to understand its behavior when the parameter changes. While is a fixed number associated with the fixed observation , we have that is a random variable. To understand how the statistic behaves when changes, we need to study this random variable.
- Definition: sampling distribution
- The sampling distribution of under the distribution of is the distribution of the random variable :
The key observation here is that the sampling distribution of depends on the unknown parameter . The more it depends on , the more information conveys about it.
The result of a deterministic transformation applied to can not convey more information than . So it is a form of compression. How much we can compress the sample without loosing interesting information about ?
Let’s define a name for statistics that carry no information about the parameter.
- Definition: ancillary statistic
- A statistic is ancillary for the parameter if its sampling distribution does not functionally depend on . Consequence: such statistics carry no information about .
So, what information is lost when we use to compress the sample? To answer this question, we need to understand what different samples and are compressed into the same value .
- Definition: level set
- The level sets of are the sets:
This sets are of interest because all the observations of that falls in a given level set are equivalent as far as is concerned. They all reduce to the same value .
Let’s look at the distribution of conditional on a given level set of .
- When changes depending on , we are loosing the information conveyed by this dependence.
- When is functionally independent of , then contains no information about on the set and we are not loosing any information on this set.
- If this is true for all possible values of , then our statistic contains the same information about as itself does. In other words, knowing the exact value of does not convey more information than knowing . Let’s define a name for this.
- Definition: sufficient statistic
- A statistic is said to be sufficient for the parameter if does not depend on .
- Example: coin tossing
- We model toss of a biased coin using an i.i.d. sample from the distribution, where the probability to obtain head is unknown. Let be the number of heads among the toss.
And we see that is sufficient for : knowing which tosses came heads is irrelevant in deciding the probability of head. Only the number of observed heads matters.
While sufficient statistics are incredibly usefull, the definition is hard to verify in practice. The Fisher-Neyman factorization theorem provides an easier way to identify sufficient statistics.
- Fisher-Neyman factorization theorem
- Let be a random vector with joint density function . A statistic is sufficient for if and only if there exists functions and such that:
So, sufficient statistics compress data without information loss about the parameter of interest. Still, some sufficient statistic might contain more data than necessary. How much can we compress?
- Definition: minimally sufficient statistic
- A statistic is said to be minimally sufficient for the parameter if it is sufficient for and for any other sufficient statistic there exists a function such that:
Since the deterministic function can only reduce the amount of conveyed information and not increase it, we see that is the sufficient statistic that contains the less information.
So, statistics compress the sample and contain information about the unknown parameter. How do we retrieve this parameter? We use a point estimator.
Let’s see an example.
Gaussian Sufficient Statistics
Let a sample of size . Define the following statistics:
The pair is minimally sufficient for and we have:
Using convergence results, we can conclude that as the sample size increases, converges to at the speed of . Likewise, converges to :