Introduction to hypothesis testing

We introduce the basic vocabulary required to understand hypothesis testing and define the p-value.


Scientists accept a theory as long as a better theory hasn’t been found. Each time a theory is recognized, we have no way to determine whether it is true for sure, but at least we know it’s better than the previous theory we had.

For instance, the laws of Newton () were widely accepted and used with success. It turned out that they were only approximately true and a better theory was found in relativity. Has relativity theory found the true equations of nature? We don’t know, but it’s the model in use until we find a better one.

Hypothesis testing provides a tool to reject an existing theory when compared to a new candidate theory.

Rejection vs acceptation

Before diving into hypothesis testing, it’s important to understand why probability theory can be used to reject an hypothesis, but not accept one.

Hypotheses are modeled by probability distributions. Given an observation , we can ask “how probable is it that model has generated ?”

The answer is:

If this probability is high, does it mean that we should accept ? Not necessarily because it could be high by coincidence. Also another hypothesis might yield a higher probability.

But if this probability is very small, we don’t need a second hypothesis to suspect that is a bad model.

The hypotheses

As always in statistics, we model all this with samples and distributions.

Let be a sample of random variables. Model the source of as the distribution where is an unknown parameter.

We model the existing theory with a subset and the candidate theory with another disjoint subset . The hypotheses are:

  We keep the current theory
  The new theory is better

Given an observed sample from , which region between and is more plausible to contain the true value of the parameter?

How to decide between and ?

To decide whether we reject the old theory, we use a test function:

And we keep when or we reject and prefer when .

There exists numerous such test functions, just like there exists numerous estimators. Rather than diving in the details now, let’s discuss how to choose one.

Quantifying errors

Since we don’t have all the possible observations from the source but only a sample we might make mistakes in deciding between and . And our decision might change if we collect more data.

There are two types of mistakes:

  • type : decide in favor of when is better;
  • type : decide in favor of when is better.
  better better
Choose no error Type error
Choose Type error no error

In practice, one type of error is more costly than the other.

For instance, if we decide in favor of when in fact is better, this means we choose the new theory when we should have kept the old one.

  • This is very costly because every textbook will be updated with the new theory, only to discover a few years later that we should switch back to the old one.
  • On the other hand, if we decide to keep the old theory when is better (type error), then there is no immediate cost and we can always re-evaluate the new theory when we have more data.

So we fix a significance level to bound the probability of type errors:

And we only consider the test functions that can garantee the above threshold is respected.

In terms of the test function , the probability of type error is written:

The -value

Let’s take a family of test functions such that has significance level :

Given a sample , each test function will decide between keeping or rejecting .

Recall that for a test function :

  • is rejected when ;
  • And this is an error with probability at most .

The -value is the smallest such that is rejected:

In other words, it can be considered as the probability of making an error when rejecting .

  • When is small, there is little probability that the test function is mistaken in rejecting and we can be confident if it does.
  • When is large, there is high probability that the test function makes a mistake so we shouldn’t trust it.

It is used as a measure of evidence against :

  • small -value provides evidence against ;
  • large -value provides no evidence against .