The bias-variance-noise decomposition

Nov 04, 2018

The MSE loss is attractive because the expected error in prediction can be explained by the bias-variance of the model and the variance of the noise. This is called the bias-variance-noise decomposition. In this article, we will introduce this decomposition using the tools of probability theory.

In short, when Y=f(X)+ϵ, the bias-variance-noise decomposition is:

ES,ϵ[(Y^SY)2]=var(f^(X))+bias(f^(X))2+var(ϵ)

Notations

Let (X,Y) be a pair of random variables on Rd×R.

Assume there exists a 0-mean random noise ϵ and a function f such that:

y=f(X)+ϵ

The goal of a regression is to use a sample Strain to estimate this function:

f^Strainf

For instance, in a linear regression the function f is a linear function with parameter w:

E[YX]=wX

And the regression aims at estimating w from the training-set:

w^Strainw

Once the function fStrain is estimated, we can measure the error between a predictions y^Strain=fStrain(x) and the true value y:

LMSE(y^Strain,y)=(y^Strainy)2

The expected error in prediction is:

ES,ϵ[(y^Sy)2]=ES,ϵ[(f^S(X)(f(X)+ϵ))2]=ES,ϵ[(f^S(X)f(X)Aϵ)2]

Define A as a shorthand.

ES,ϵ[(Aϵ)2]=ES,ϵ[A2]2ES,ϵ[Aϵ]+ES,ϵ[ϵ2]

A does not depend on ϵ and ϵ does not depend on S, so:

ES,ϵ[A2]2ES,ϵ[Aϵ]+ES,ϵ[ϵ2]=ES[A2]2Eϵ[ϵ]ES[A]+Eϵ[ϵ2]

Recall that E[ϵ]=0:

ES[A2]2Eϵ[ϵ]ES[A]+Eϵ[ϵ2]=ES[A2]+Eϵ[ϵ2]

Since ϵ is a 0-mean noise we have:

var(ϵ)=E[ϵ2]E[ϵ]2=E[ϵ2]

Hence:

ES[A2]+Eϵ[ϵ2]=ES[A2]+var(ϵ)

Finally, the term ES[A2] is exactly the error in estimation between f and f^. We can exprees it using the bias-variance decomposition:

ES[A2]+var(ϵ)=bias(f^)2+var(f^)+var(ϵ)

Finally, the bias-variance-noise decomposition is:

ES,ϵ[(y^Sy)2]=bias(f^)2+var(f^)+var(ϵ)