The bias-variance decomposition

Nov 04, 2018

The MSE loss is attractive because the expected error in estimation can be explained by the bias and the variance of the model. This is called the bias-variance decomposition. In this article, we will introduce this decomposition using the tools of probability theory.

In short, the bias-variance decomposition is:

ES[(ˆfS(X)f(X))2]=var(ˆf(X))+bias(ˆf(X))2

In machine-learning, the error in estimation is not the same as the error in prediction. The error in prediction can be explained in terms of the bias-variance-noise decomposition.

Notations

Let (X,Y) be a pair of random variables on Rd×R.

Assume there exists a function f such that:

E[YX]=f(X)

The goal of a regression is to use a sample Strain to estimate this function:

ˆfStrainf

For instance, in a linear regression the function f is a linear function with parameter w:

E[YX]=wX

And the regression aims at estimating w from the training-set:

ˆwStrainw

The expected output

The output ˆfStrain of the regression depends on the sample S=Strain that was used during training. In expectation, the estimated function is:

ˆfE(X)=ES[ˆfS(X)]

Bias

The bias measures how wrong the model is on average. It is the difference between this expected function and the true function:

bias(ˆf(X))=ˆfE(X)f(X)

Variance

The variance measures how instable the model is. The more the estimated function ˆfS depends on the specific details of the training-set S, the higher the variance. It is equal to:

var(ˆf(X))=ES[ˆfS(X)2]ˆf2E(X)

The bias-variance decomposition

What error in estimation can we expect when Strain variates?

ES[(ˆfS(X)f(X))2]=ES[ˆfS(X)2]+f(X)22f(X)ES[ˆfS(X)]ES[ˆfS(X)2]+f(X)22f(X)ˆfE(X)ES[ˆfS(X)2]ˆfE(X)2+ˆfE(X)2+f(X)22f(X)ˆfE(X)ES[ˆfS(X)2]ˆfE(X)2+(f(X)ˆfE(X))2=var(ˆf(X))+bias(ˆf(X))2

The bias-variance decomposition is:

ES[(ˆfS(X)f(X))2]=var(ˆf(X))+bias(ˆf(X))2

It is important to distinguish error in estimation (between f and ˆf) and error in prediction (between y and ˆy). Let’s tackle the error in prediction now.

The bias-variance-noise decomposition

A frequent use case for regression is when y is a signal of X distorted by some random 0-mean noise ϵ:

{y=f(X)+ϵE[ϵ]=0

In such cases, the error in prediction between Y and ˆY can be expressed using the bias-variance-noise decomposition:

ES,ϵ[(ˆySy)2]=bias(ˆf)2+var(ˆf)+var(ϵ)

See our dedicated article for more info.

Illustration

This can be illustrated using polynomial regressions. A polynomial regression of degree 1 is a line regression, which have high bias but very low variance. As the polynomial degree increases, the bias reduces but the variance increases.

On the plot below:

  • the red curve is the deterministic relationship ftrue(x);
  • the red dots are observations (x,y) polluted by the noise ϵ (which explains why they are not on the line);
  • the blue curve is the regression line of one model fStrain.

Polynomial fitting

To vizualize the bias and variance tradeoff, we need to train several models. Let’s generate several trainset StrainS using the source. Recall that the relationship ftrue to be learned is the same accross datasets, but the noise observations are random.

On the plot below:

  • the red curve is the deterministic relationship ftrue;
  • we didn’t plot the red dots to avoid cluter;
  • the blue curves are the regression lines for each of the trained model generated thus.

Variance in polynomial fitting

We can see that as the degree increases the blue curves are more and more appart from each other. This is a manifestation of the high variance.

Taking the average of the blue lines, we can visualize the bias. On the picture below, we graphed the mean of the regression curves in blue. The gray shape around it is the spread of all the regression curves.

Variance in polynomial fitting 2

  • We can see that for low degree polynomials, the blue curve does not fit the red curve: they have high bias. But the gray shape has small width: they have low variance.
  • On the other hand, for high degree polynomials, the blue curve perfectly match the red curve: they have low bias. But the gray shape has large width: they have high variance.