Introduction to hypothesis testing

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

We introduce the basic vocabulary required to understand hypothesis testing and define the p-value.

Introduction

Scientists accept a theory as long as a better theory hasn’t been found. Each time a theory is recognized, we have no way to determine whether it is true for sure, but at least we know it’s better than the previous theory we had.

For instance, the laws of Newton ( $\vf = \sm\va$ ) were widely accepted and used with success. It turned out that they were only approximately true and a better theory was found in relativity. Has relativity theory found the true equations of nature? We don’t know, but it’s the model in use until we find a better one.

Hypothesis testing provides a tool to reject an existing theory when compared to a new candidate theory.

Rejection vs acceptation

Before diving into hypothesis testing, it’s important to understand why probability theory can be used to reject an hypothesis, but not accept one.

Hypotheses are modeled by probability distributions. Given an observation $\sy$ , we can ask “how probable is it that model $1$ has generated $\sy$ ?”

The answer is:

$\prob[\sy \mid \theta_1]$

If this probability is high, does it mean that we should accept $\theta_1$ ? Not necessarily because it could be high by coincidence. Also another hypothesis $\theta_2$ might yield a higher probability.

But if this probability is very small, we don’t need a second hypothesis to suspect that $\theta_1$ is a bad model.

The hypotheses

As always in statistics, we model all this with samples and distributions.

Let $\rvy = (\ry_1, \dotsc, \ry_\sn)$ be a sample of $\sn$ random variables. Model the source of $\rvy$ as the distribution $\ff_{\rvy}(\vy \mid \theta)$ where $\theta \in \Theta$ is an unknown parameter.

We model the existing theory with a subset $\Theta_0 \subset \Theta$ and the candidate theory with another disjoint subset $\Theta_1 \subset \Theta$ . The hypotheses are:

	$H_0: \theta \in \Theta_0$	We keep the current theory
	$H_1: \theta \in \Theta_1$	The new theory is better

Given an observed sample $\vy = (\sy_1, \dotsc, \vy_\sn)$ from $\rvy$ , which region between $\Theta_0$ and $\Theta_1$ is more plausible to contain the true value $\theta$ of the parameter?

How to decide between $H_0$ and $H_1$ ?

To decide whether we reject the old theory, we use a test function:

$\delta(\vy) \in \{0, 1\}$

And we keep $H_0$ when $\delta(\vy) = 0$ or we reject $H_0$ and prefer $H_1$ when $\delta(\vy)$ .

There exists numerous such test functions, just like there exists numerous estimators. Rather than diving in the details now, let’s discuss how to choose one.

Quantifying errors

Since we don’t have all the possible observations from the source $\ff_{\rvy}$ but only a sample $\vy$ we might make mistakes in deciding between $H_0$ and $H_1$ . And our decision might change if we collect more data.

There are two types of mistakes:

type $1$ : decide in favor of $H_1$ when $H_0$ is better;
type $2$ : decide in favor of $H_0$ when $H_1$ is better.

	$H_0$ better	$H_1$ better
Choose $H_0$	no error	Type $2$ error
Choose $H_1$	Type $1$ error	no error

In practice, one type of error is more costly than the other.

For instance, if we decide in favor of $H_1$ when in fact $H_0$ is better, this means we choose the new theory when we should have kept the old one.

This is very costly because every textbook will be updated with the new theory, only to discover a few years later that we should switch back to the old one.
On the other hand, if we decide to keep the old theory when $H_1$ is better (type $2$ error), then there is no immediate cost and we can always re-evaluate the new theory when we have more data.

So we fix a significance level $\alpha$ to bound the probability of type $1$ errors:

$\prob[\text{type 1 error}] < \alpha$

And we only consider the test functions $\delta$ that can garantee the above threshold is respected.

In terms of the test function $\delta$ , the probability of type $1$ error is written:

$\prob[\delta(\rvy) = 1 \mid \theta\in\Theta_{0}] < \alpha$

The $p$ -value

Let’s take a family of test functions $\{\delta_\alpha \mid \alpha \in \realset\}$ such that $\delta_\alpha$ has significance level $\alpha$ :

$\prob[\delta_\alpha(\rvy) = 1 \mid \theta\in\Theta_{0}] < \alpha$

Given a sample $\vy$ , each test function will decide between keeping $H_0$ or rejecting $H_0$ .

Recall that for a test function $\delta_\alpha$ :

$H_0$ is rejected when $\delta_\alpha(\vy) = 1$ ;
And this is an error with probability at most $\alpha$ .

The $p$ -value is the smallest $\alpha$ such that $H_0$ is rejected:

$p(\vy) = \inf\{\alpha \mid \delta_\alpha(\vy) = 1\}$

In other words, it can be considered as the probability of making an error when rejecting $H_0$ .

When $p(\vy)$ is small, there is little probability that the test function is mistaken in rejecting $H_0$ and we can be confident if it does.
When $p(\vy)$ is large, there is high probability that the test function makes a mistake so we shouldn’t trust it.

It is used as a measure of evidence against $H_0$ :

small $p$ -value provides evidence against $H_0$ ;
large $p$ -value provides no evidence against $H_0$ .