This article explains in simple terms the purpose of statistical theory and gives an overview of how it is used.

Statistics is a branch of mathematics composed of two aspects: descriptive statistics and statistical inference.

Descriptive statistic can be assimilated to a language used to summarize data. For instance, *mean* and *average* are words from the descriptive statistics lexicon.

Statistical inference is the clever part of statistics: when we are interested in some feature of a large population we usually can’t examine every member of the population, so we take a random sample and use this incomplete information to make reasonable guesses about the population. The whole purpose of the theory is to quantify what reasonable guesses are and under what circumstances they hold.

Statistical inference: **sample data ⟹ probability model**

Statistics are often confused with probability theory or machine learning. Check out my article on the topic: Differences between statistics, probability and machine learning

We may at once admit that any inference from the particular to the general must be attended with some degree of uncertainty, but this is not the same as to admit that such inference cannot be absolutely rigorous, for the nature and degree of the uncertainty may itself be capable of rigorous expression. – Ronald A. Fisher

## Non-technical overview

I will now illustrate how we can use statistical inference to gain knowledge about a population. I removed most technical developments from the exposition. Also, I deliberately used some statistical terms such as *sample* and *parameter* so that this example serves as a first introduction to the statistical lexicon. You should pay attention to the mathematical letters such as . There are few of them and they are often repeated in the text to ease their memorization. They come in handy for more involved explainations.

To illustrate the process, suppose that you have a general population *P*. For instance, this population can be the heights of all US citizens, which is set of number. Suppose we want to gain knowledge about a *population parameter*. For instance, the *mean* of those heights.

Since we can’t measure everyone in America, we will restrict ourselves to a *random sample*, which means we randomly choose a given number (note this number ) of people and we measure them. This gives us numbers that we call the *sample*. As a shorthand, I will use the notation:

Statistics uses several approaches and tools to learn about the *population parameter* from our *sample* . There are three approaches:

- point estimates;
- confidence intervals;
- hypothesis testing.

Actually, point 1 and 2 are very similar so there really only two approaches.

A distinction without a difference has been introduced by certain writers who distinguish “Point estimation”, meaning some process of arriving at an estimate without regard to its precision, from “Interval estimation” in which the precision of the estimate is to some extent taken into account. – R. A. Fisher (1956)

### 1. Point estimates

We can estimate the *parameter* by its statistical counterpart in the *sample*. Thus, for instance, we would estimate the *population mean* by the *sample mean* . We don’t know what the real *population mean* is because we can’t measure everyone, but we can compute the mean of our *sample* .

The question is: how are the *population mean* and the *sample mean* related? *A priori*, nothing tells us that they should be equal. Maybe our *sample mean* will be very different from the total *population mean*. If you have trouble visualising this, imagine that our *random sample* yields the heights of 100 US-citizens aged between 2 and 5 years old… Your *sample mean* is likely to be well under the *population mean*.

*Comics by dilbert*

Notice that if we took a number of different *samples*, each of size , from the *population*, we would get a different *sample mean* each times. So, if we collect different samples and they usually give us different results, how can we make any inference about the *population*?

The key insight is that even though the values of the statistic are likely to differ from sample to sample, they will follow a pattern. This pattern is called the **sampling distribution**. In formal terms, the *sampling distribution* of a statistic is the probability distribution for the set of possible values that can be assumed by the statistic. If you don’t know what a distribution is, read my dedicated article: Introduction to data distributions (to be redacted).

So, in order to know if our *estimator* is a good, we focus our study on the distribution of the *estimator* accross different sample. If we took many *samples*, would their *means* average out to the true *population mean* ? And if so, how close to the *population mean* will our *sample mean* typically be?

In statistical terms, both questions are about the *sampling distribution* of our *estimator* :

- Where is the
*sampling distribution*of centered? - How does the variability of across samples compare to the variability in the
*population*?

In the well studied case of *means*, statistical theory answers *yes* to the first question. The *sample means* average out to the *population mean* , or in other words: our *sampling distribution* is centered at the *population mean* . We say that our *estimator* is *unbiased*.

As a side note, remark that we could very well use an element from the sample as our estimator, instead of the whole sample’s mean. In that case, we can show that this new estimator is *unbiased* too. But it’s less interesting because the confidence interval that we can attach to it is less precise than that of the mean estimator, because it fails to take additional information into account as the size of the sample grows. See below to learn about confidence intervals.

Regarding the second question, theory also tells us that the bigger our *sample size*, the closer to the *population mean* our *sample mean* will be. This makes sense, since as the *sample size* grows we incorporate more and more information from the *population*.

So far, we know that we can estimate the *population mean* using our *sample mean*, and we know that the bigger the sample, the more precise our estimation will be. But how precise exactly?

### 2. Confidence intervals

We can use the *sampling distribution* once again to establish *confidence intervals*. It is a statement such as:

The average height of males in the US is 175cm +/- 6.2cm

What does it mean? And how do we compute the interval?

#### What is a confidence interval?

Since *confidence intervals* are used everywhere (on the news, on the internet, etc.), we will take the time to clearly define what they mean.

If we have *95% confidence interval* for a *population parameter*, this means that 95% of all possible *random samples* will yield data for which the interval contains the *population parameter*. The remaining 5% of the *random samples* will yield data for which it doesn’t. So, once the *random sample* is chosen, there are no probabilities: either the *sample* yields an interval that contains the parameter, or it doesn’t.

*Confidence intervals* qualify the *sampling process*. You have 95% chance to choose a *sample* yielding a *confidence interval* that actually contains the *parameter*. And you have 5% chance to choose a *sample* that doesn’t.

For instance, if the *95% confidence interval* is +/-6.2cm, then we have 95% chance to choose a *sample* such that the *mean height* of the total *population* satisfies:

\[\bar{X} - 6.2cm \leq \bar{P} \leq \bar{X} + 6.2cm\]

Where is our *sample mean* as previously defined.

#### How do we compute confidence intervals?

to be continued