Understanding p-values

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

Hypothesis testing and p-values are often misused and misunderstood. In this article, I explain what a p-value is, and how to use it.

First, we must understand in which situations it is appropriate to use p-values.

When to use p-valued hypothesis testing?

Hypothesis testing with p-values is appropriate when you must decide between two courses of action and one of them has significantly lower cost than the other.

For instance, a company must decide between:

$H_0$ : keep the actual number of waiters in my restaurants;
$H_1$ : increase the number of waiters in my restaurants.

Another example, a scientist must decide between:

$H_0$ : the current theory is valid;
$H_1$ : current theory is invalid and my new theory is better.

If we think of $H_0$ as being the current accepted model for the laws of physics and $H_1$ a new set of laws, the cost of switching to $H_1$ is huge. Every textbook must be updated, scientists must all learn the new theory, etc.

The cost for $H_0$ is null while the cost for $H_1$ is significantly higher. This is why $H_0$ is called the null hypothesis.

Our prefered course of action is $H_0$ and it is the course of action that we will follow by default.

What is p-valued hypothesis testing?

Hypothesis testing is a tool that relies on data. It can tell us if the data is a counter-example to our hypothesis $H_0$ , in which case we say that $H_0$ is rejected. Adn when the data is not a counter-example, then $H_0$ is neither rejected nor accepted.

The same process can be observed in abstract mathematics. To prove a theorem, we need a formal proof. But to reject it, we only need a counter-example.

Since in real life we don’t know the exact rules of “nature”, we can’t prove formally that $H_0$ is true. But we can try to find counter-examples in the available data.

So, by design the statistical test attempts to reject $H_0$ using the data. But contrary to abstract mathematics, in statistics we must deal with uncertainty. In particular, mismatch between the statistical test and the type of data provided can happen, in which case we can’t be 100% confident in the test’s output.

This is why a $p$ -valued test will tell us how confident we can be in its answer. The output of such test is:

“You should reject $H_0$ . And here is the probability that I’m mistaken: $p$ ”

When the $p$ -value is small, there is little probability that the test is mistaken and we can be confident in rejecting $H_0$ (i.e. saying that the data is a counter-example to $H_0$ ).

When the $p$ -value is large, however, there is high probability that the test is mistaken and we shouldn’t trust its output. So what can we do? We can use a different test; gather more data; or stay with $H_0$ until next time we attempt to reject it.

About statistical significance

So, when the $p$ -value is small, we can trust the tool we used. When it is big, we can’t trust the tool because it’s likely to produce bogus results.

But how small is small enough?

The common convention is to set the threshold at $p < 0.05$ . This means that we want a probability smaller that $0.05$ that the tool is bogus.

A good way to interpret this is in terms of frequency. Out of $100$ , the tool produces gibberish $5$ times. In other words, the tool can be trusted only $95\%$ of the time.

Depending on the cost to implement $H_1$ , we might want to requires that the tool be trusted $99\%$ of the time, in which case we will set the threshold at $p < 0.01$ .

How to use it?

To use a statistical test, me must model our situation into a statistical formulation. This is precisely because of this modeling step that there can be a mismatch between our data and the test we use, and that we must quantify how trustworthy the test’s results are.

Data

Concretely, we start by gathering some numerical data $(\sx_1, \cdots, \sx_\sn)$ under the conditions of the alternative hypothesis.

For the company example, the data could be:

$\sx_\si = \text{amount of money spent by client } \si$

Collected in some restaurants where the staff was actually increased for the purpose of testing.

We can’t directly compare the average amount of money spent (AAM) in the normal restaurants to AAM in the staffed restaurants because of uncertainty: if we collect more data, those averages might slightly change. If we see that one average is bigger than the other, does it mean that there is really a difference, or is it simply a random effect? In statistics, we model uncertainty using probability distributions. So, instead of comparing the averages, we use a statistical test to compare the underlying distributions.

Model

Then, we use statistical modeling to model the null hypothesis $H_0$ by some probability distribution and the alternative $H_1$ by another probability distribution.

For instance the company wants to know if the average amount of money spent increased under $H_1$ .

It already knows the average amout ( $\sm_0$ ) of money spent by its customers in regular restaurants. So it can choose a gaussian distribution with mean $\sm_0$ for $H_0$ . In statistical term, $H_0$ is modeled by:

$H_0: (\sx_1, \cdots, \sx_\sn) \iid \gaussian(\sm_0, 1)$

For $H_1$ , it want to test if the average amount increased so it can take a gaussian distribution with mean $\sm_1 > \sm_0$ for $H_1$ :

$H_1: (\sx_1, \cdots, \sx_\sn) \iid \gaussian(\sm_1, 1)$

The choice of a gaussian model requires statistical knowledge and is a potential source of mismatch between the statistical tools we will use and the actual data. Choosing a gaussian model means that we will use a test for gaussian distributions. Had we chosen a different model, we would have used a different test.

Test

Then, we use the statistical test on the data and the models.

Here is what the test outputs:

“Your data is incompatible with the distribution of $H_0$ and there is probability $p$ that I don’t know what I’m talking about”

When $p$ is small, we can be confident that the data has not been produced by $H_0$ and thus reject it. When $p$ is large, we only know that the test produced a useless result.