What is “learning” and do we have a formal model for it? I’ve decided to dive into the theoretical underpinnings of machine-learning, so here’s a quick introduction to the theory of Probably Approximately Correct (PAC) learning.

If you want to read more on the topic, there is a very accessible book on the topic, Understanding Machine Learning: From Theory to to Algorithms which we use in the first part of the *learning theory* class at EPFL.

## Notations

Let’s start by defining a bunch of very standard notations.

We have a *domain set* of objects (say, some measurements on apples) and a *label set* of labels (say, “good”, “not good”). Let’s note and take labeled samples .

We assume that the objects in are of the same nature (say, all apples and no t-shirt). This is modeled by assuming that they are all drawn independently (i.i.d.) from a probability distribution on . The distribution is unknown to the learner and must be infered from the data.

Since the distribution is on , we can have two identically looking objects in with different labels. For instance, two apples with same size and color (same ) might taste differently (different ).

The goal of a learner is to use the training samples to learn how to label an object in with the correct label .

More precisely, we want to use to induce a *prediction rule* which assigns a label to an object . A *learning algorithm* is an algorithm that chooses the best prediction rule among a class of available rules. is also called the hypothesis class.

So, how do we define which prediction rule is the *best*? If we measure the cost of an error made by on through a loss function , it would be a good idea to minimize the expected risk over all possible samples drawn from :

Since is unknown, we can’t always achieve its minimum in practice. Enters the PAC framework.

## (Agnostic) PAC Learning

In the framework of PAC Learning, we are interested in learning a good enough model with accuracy such that:

And further, we’re okay if the result is only probably correct with less than probability to make a mistake. We say that is the confidence level. In other words, we want:

We say that an hypothesis class is PAC-Learnable if given the accuracy and the confidence level , we can always find a lower bound on the size of the sample set that makes them achievable.

- Definition: (A)PAC Learnable
- We say that the hypothesis class is PAC-Learnable with respect to and when there exists a function (called
*sample complexity*):

and a learning algorithm such that: :

Now, let’s see what result we can get on the following simple learning criterion.

## Empirical risk minimizer

Since is unknown, we can try to find an estimator for . The most common estimator for an expectation is the sample mean. By the (strong) law of large numbers and the central limit theorem, this is a very reasonable estimator:

The decision rule that minimizes this estimator is the ERM:

But how does minimizing the empirical risk relate to minizing the expected risk? Without any restriction, the ERM is prone to overfitting and yields poor results. This is seen easily using a kNN classifier with , for which the empirical loss is always but the expected loss can be arbitrarily large.

So what can constraint can we impose on the ERM to try and fix the overfitting problem?

There are two kinds of problems that can occur with ERM:

- the sample set is not representative for ;
- even with well behaved sample set, the class might be so flexible that we overfit.

## Constraints on the training set

Since the ERM is defined through , it makes sense that restricting the kind of sample set we consider would allow us to control the performance of ERM. For instance, when is -representative, we get good results.

- Definition: -representative
- is -representative when:

- Lemma: ERM on -representative
- If is -representative, then:

Proof: we chain inequalities using the definition of -representative or of the ERM at each step.

## Constraints on the hypothesis class

Another way to ensure the performance of the ERM is to introduce *inductive bias* by restricting the class of allowed hypotheses. Limiting the class of rules we consider is a way to introduce prior knowledge into the learning process.

Roughly speaking, the stronger the prior knowledge that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions. (Shalev-Shwartz S., Ben-David S.)

Intuitively, the smaller the class, the less opportunity for overfitting, but the higher the risk of introducing bias. This is related to the famous bias-variance tradeoff.

Overfitting happens when the class of models is too flexible to be constrained appropriately by the number of samples available. It can always be solved by using a bigger training set. Thus, it makes sense that when the hypothesis class is finite, we can get a lower bound of the number of training samples required.

- Lemma
- Let be a
*finite*hypothesis class and the -loss (actually it is enough that is bounded). Then is PAC-Learnable with sample complexity:

- Proof.
- The proof uses Hoeffding’s bound. Let’s note . The confidence level to have -representative is:

To conclude, inverse the bound and use the previous result on -representative sample set.