Introduction to PAC Learning

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

What is “learning” and do we have a formal model for it? I’ve decided to dive into the theoretical underpinnings of machine-learning, so here’s a quick introduction to the theory of Probably Approximately Correct (PAC) learning.

If you want to read more on the topic, there is a very accessible book on the topic, Understanding Machine Learning: From Theory to to Algorithms which we use in the first part of the learning theory class at EPFL.

Notations

Let’s start by defining a bunch of very standard notations.

We have a domain set $X$ of objects (say, some measurements on apples) and a label set $Y$ of labels (say, “good”, “not good”). Let’s note $Z = X \times Y$ and take $m$ labeled samples $S = (z_1, ..., z_m)$ .

We assume that the objects in $X$ are of the same nature (say, all apples and no t-shirt). This is modeled by assuming that they are all drawn independently (i.i.d.) from a probability distribution $\newcommand{\pD}{\mathcal{D}} \pD$ on $Z$ . The distribution $\pD$ is unknown to the learner and must be infered from the data.

Since the distribution $\pD$ is on $Z$ , we can have two identically looking objects in $X$ with different labels. For instance, two apples with same size and color (same $x$ ) might taste differently (different $y$ ).

The goal of a learner is to use the training samples $S$ to learn how to label an object in $X$ with the correct label $y \in Y$ .

More precisely, we want to use $S$ to induce a prediction rule $h: X \to Y$ which assigns a label $y = h(x)$ to an object $x$ . A learning algorithm $A:S\to H$ is an algorithm that chooses the best prediction rule $h$ among a class $H$ of available rules. $H$ is also called the hypothesis class.

So, how do we define which prediction rule is the best? If we measure the cost of an error made by $h$ on $z$ through a loss function $l(h, z)$ , it would be a good idea to minimize the expected risk over all possible samples drawn from $\pD$ :

$\l_{\pD}(h) = \expectation_{z\sim\pD}\brak{l(h, z)}$

Since $\pD$ is unknown, we can’t always achieve its minimum in practice. Enters the PAC framework.

(Agnostic) PAC Learning

In the framework of PAC Learning, we are interested in learning a good enough model $h_{PAC}$ with $\epsilon$ accuracy such that:

$\l_{\pD}(h_{PAC}) \leq \min_{h \in H} \l_{\pD}(h) + \epsilon$

And further, we’re okay if the result is only probably correct with less than $\delta$ probability to make a mistake. We say that $\delta$ is the confidence level. In other words, we want:

$\prob\brak{\l_{\pD}(h_{PAC}) \leq \min_{h \in H} \l_{\pD}(h) + \epsilon} \geq 1 - \delta$

We say that an hypothesis class $H$ is PAC-Learnable if given the accuracy $\epsilon$ and the confidence level $\delta$ , we can always find a lower bound $m_{H}$ on the size of the sample set $S$ that makes them achievable.

Definition: (A)PAC Learnable: We say that the hypothesis class $H$ is PAC-Learnable with respect to $Z$ and $l$ when there exists a function (called sample complexity):

$m_{H}(\epsilon, \delta):\quad ]0;1[^2 \to \mathbb{N}$

and a learning algorithm $A$ such that: $\forall \epsilon, \delta, \pD$ :

$\card{S}\geq m_{H}(\epsilon, \delta) \implies \prob\brak{\l_{\pD}(h_{PAC}) \leq \min_{h \in H} \l_{\pD}(h) + \epsilon} \geq 1 - \delta$

Now, let’s see what result we can get on the following simple learning criterion.

Empirical risk minimizer

Since $\pD$ is unknown, we can try to find an estimator for $\l_{\pD}(h)$ . The most common estimator for an expectation is the sample mean. By the (strong) law of large numbers and the central limit theorem, this is a very reasonable estimator:

$\l_{S}(h) = \frac{1}{m}\sum_{i = 1}^{m}l(h, z_i)$

The decision rule $h$ that minimizes this estimator is the ERM:

$ERM(S) = \argmin_{h \in H} \l_{S}(h)$

But how does minimizing the empirical risk relate to minizing the expected risk? Without any restriction, the ERM is prone to overfitting and yields poor results. This is seen easily using a kNN classifier with $k=1$ , for which the empirical loss is always $0$ but the expected loss can be arbitrarily large.

So what can constraint can we impose on the ERM to try and fix the overfitting problem?

There are two kinds of problems that can occur with ERM:

the sample set $S$ is not representative for $\pD$ ;
even with well behaved sample set, the class $H$ might be so flexible that we overfit.

Constraints on the training set

Since the ERM is defined through $S$ , it makes sense that restricting the kind of sample set we consider would allow us to control the performance of ERM. For instance, when $S$ is $\frac{\epsilon}{2}$ -representative, we get good results.

Definition: $\epsilon$ -representative: $S$ is $\epsilon$ -representative when:

$\forall h \in H, \quad \abs{\l_{\pD}-\l_{S}}\leq\epsilon$

Lemma: ERM on $\epsilon$ -representative $S$: If $S$ is $\frac{\epsilon}{2}$ -representative, then:

$\l_{\pD}\circ ERM(S) \leq \min_{h \in H} \l_{\pD}(h) + \epsilon$

Proof: we chain inequalities using the definition of $\epsilon$ -representative or of the ERM at each step.

$\forall h', \quad \l_{\pD}(h) \underbrace{\leq}_{\epsilon} \l_{S}(h) \underbrace{\leq}_{ERM} L_{S}(h') + \frac{\epsilon}{2} \underbrace{\leq}_{\epsilon} \l_{\pD}(h') + \epsilon$

Constraints on the hypothesis class

Another way to ensure the performance of the ERM is to introduce inductive bias by restricting the class $H$ of allowed hypotheses. Limiting the class of rules $h$ we consider is a way to introduce prior knowledge into the learning process.

Roughly speaking, the stronger the prior knowledge that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions. (Shalev-Shwartz S., Ben-David S.)

Intuitively, the smaller the class, the less opportunity for overfitting, but the higher the risk of introducing bias. This is related to the famous bias-variance tradeoff.

Overfitting happens when the class of models is too flexible to be constrained appropriately by the number of samples available. It can always be solved by using a bigger training set. Thus, it makes sense that when the hypothesis class is finite, we can get a lower bound of the number of training samples required.

Lemma: Let $H$ be a finite hypothesis class and $l$ the $0/1$ -loss (actually it is enough that $l$ is bounded). Then $H$ is PAC-Learnable with sample complexity:

$m_{H}(\epsilon, \delta) = \frac{1}{2\epsilon^2}\log\paren{\frac{2\card{H}}{\delta}}$

Proof.: The proof uses Hoeffding’s bound. Let’s note $m = \card{S}$ . The confidence level to have $S$ $\epsilon$ -representative is:

$\begin{align*} \delta &= \prob\brak{\exists h\in H : \abs{\l_{\pD}(h) - \l_{S}(h)}\geq \epsilon} \\ &\leq \sum_{h \in H} \prob\brak{\abs{\l_{\pD}(h) - \l_{S}(h)}\geq \epsilon}\\ &\leq \card{H} \times 2e^{-2m\epsilon^2} \end{align*}$

To conclude, inverse the bound and use the previous result on $\epsilon$ -representative sample set.