The MSE loss is attractive because the expected error in prediction can be explained by the bias-variance of the model and the variance of the noise. This is called the bias-variance-noise decomposition. In this article, we will introduce this decomposition using the tools of probability theory.
In short, when , the bias-variance-noise decomposition is:
Notations
Let be a pair of random variables on .
Assume there exists a -mean random noise and a function such that:
The goal of a regression is to use a sample to estimate this function:
For instance, in a linear regression the function is a linear function with parameter :
And the regression aims at estimating from the training-set:
Once the function is estimated, we can measure the error between a predictions and the true value :
The expected error in prediction is:
Define as a shorthand.
does not depend on and does not depend on , so:
Recall that :
Since is a -mean noise we have:
Hence:
Finally, the term is exactly the error in estimation between and . We can exprees it using the bias-variance decomposition:
Finally, the bias-variance-noise decomposition is: