General formulation
Let be a pair of random variables on .
A linear regression model assumes that is a linear function of :
Where the parameter is unknown.
A common use-case for this hypothesis is when is the sum of a linear signal and -mean random-noise :
Learning
Learning consists in finding an estimate for based on a sample made of independent observations of :
This can be done in two (sometimes equivalent) ways:
- use an appropriate estimator tailored to the distribution of ; or
- define a loss function and minimize the expected risk.
Let’s detail the second approach.
Loss function
The “best” value for the parameter can be defined through a loss function:
which measures the error between and . Thus, increases with how wrongly estimates .
Risk
For a fixed value of , the expected error in prediction is:
Which we will name the risk.
Ideally, we would want to minimize it:
But since we don’t know the distribution of , the true risk is unknown. So we need to find a proxy: we will first estimate the risk, then estimate the parameters.
Empirical risk
For a fixed , the best estimate for the true risk given a sample is the sample average of the loss:
Using the usual convergence theorems (law of large number, CLT, Chernoff bound) we know that:
- the empirical risk is an unbiased estimator for the true risk; and
- the error between them decreases as .
Learning using the empirical risk
Given a sample dedicated to training, we can compute our estimate for :
We use as subscript to reminds us that greatly depends on the training-sample.
The training-error is the empirical risk computed on the training sample:
It can’t reliably be used to estimate because of the way is computed. For instance, with extreme overfiting it can happen that the training error is , but the real risk way above .
How does the model performs on unseen data?
To assess how our model performs on unseen data, we estimate the risk , which is also called the generalization error.
As explained previously, we can do so using a sample on which does not depend. i.e.
Under this condition, the test-error:
is an unbiased estimate of the generalization error and converges at the speed of .
How big should the testing sample be?
This can be answered using a Chernoff bound:
If we assume for instance that , the probability that the true risk and the empirical risk differ by more that is less than . When the contains samples, this evaluates to …