The bias-variance decomposition

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

The MSE loss is attractive because the expected error in estimation can be explained by the bias and the variance of the model. This is called the bias-variance decomposition. In this article, we will introduce this decomposition using the tools of probability theory.

In short, the bias-variance decomposition is:

$\expectation_{\sets}[(\,\hat{\ff}_{\sets}(\rvx) - \ff(\rvx)\,)^2] = \var(\hat{\ff}(\rvx)) + \bias(\hat{\ff}(\rvx))^2$

In machine-learning, the error in estimation is not the same as the error in prediction. The error in prediction can be explained in terms of the bias-variance-noise decomposition.

Notations

Let $(\rvx, \ry)$ be a pair of random variables on $\realvset{\sd} \times \realset$ .

Assume there exists a function $\ff$ such that:

$\expectation[\ry \mid \rvx] = \ff(\rvx)$

The goal of a regression is to use a sample $\trainset$ to estimate this function:

$\hat{\ff}_{\trainset} \approx \ff$

For instance, in a linear regression the function $\ff$ is a linear function with parameter $\vw$ :

$\expectation[\ry \mid \rvx] = \vw\cdot\rvx$

And the regression aims at estimating $\vw$ from the training-set:

$\hat{\vw}_{\trainset} \approx \vw$

The expected output

The output $\hat{\ff}_{\trainset}$ of the regression depends on the sample $\sets = \trainset$ that was used during training. In expectation, the estimated function is:

$\hat{\ff}_\expectation(\rvx) = \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)]$

Bias

The bias measures how wrong the model is on average. It is the difference between this expected function and the true function:

$\bias(\hat{\ff}(\rvx)) = \hat{\ff}_{\expectation}(\rvx) - \ff(\rvx)$

Variance

The variance measures how instable the model is. The more the estimated function $\hat{\ff}_{\sets}$ depends on the specific details of the training-set $\sets$ , the higher the variance. It is equal to:

$\var(\hat{\ff}(\rvx)) = \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)^2] - \hat{\ff}_{\expectation}^2(\rvx)$

The bias-variance decomposition

What error in estimation can we expect when $\trainset$ variates?

$\begin{align*} & \expectation_{\sets}[(\,\hat{\ff}_{\sets}(\rvx) - \ff(\rvx)\,)^2] = \\ & \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)^2] + \ff(\rvx)^2 -2\ff(\rvx)\expectation_ {\sets}[\hat{\ff}_{\sets}(\rvx)]\\ & \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)^2] + \ff(\rvx)^2 -2\ff(\rvx)\hat{\ff}_{\expectation}(\rvx)\\ & \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)^2] - \hat{\ff}_{\expectation}(\rvx)^2 + \hat{\ff}_{\expectation}(\rvx)^2 + \ff(\rvx)^2 -2\ff(\rvx)\hat{\ff}_{\expectation}(\rvx)\\ & \expectation_{\sets}[\hat{\ff}_{\sets}(\rvx)^2] - \hat{\ff}_{\expectation}(\rvx)^2 + (\ff(\rvx) - \hat{\ff}_{\expectation}(\rvx))^2\\ & = \var(\hat{\ff}(\rvx)) + \bias(\hat{\ff}(\rvx))^2 \end{align*}$

The bias-variance decomposition is:

$\expectation_{\sets}[(\,\hat{\ff}_{\sets}(\rvx) - \ff(\rvx)\,)^2] = \var(\hat{\ff}(\rvx)) + \bias(\hat{\ff}(\rvx))^2$

It is important to distinguish error in estimation (between $\ff$ and $\hat{\ff}$ ) and error in prediction (between $\sy$ and $\hat{\sy}$ ). Let’s tackle the error in prediction now.

The bias-variance-noise decomposition

A frequent use case for regression is when $\sy$ is a signal of $\rvx$ distorted by some random $0$ -mean noise $\epsilon$ :

$\begin{cases}\sy = \ff(\rvx) + \epsilon \\ \expectation[\epsilon] = 0\end{cases}$

In such cases, the error in prediction between $\ry$ and $\hat{\ry}$ can be expressed using the bias-variance-noise decomposition:

$\expectation_{\sets, \epsilon}[(\,\hat{\sy}_{\sets} - \sy\,)^2] = \bias(\hat{\ff})^2 + \var(\hat{\ff}) + \var(\epsilon)$

See our dedicated article for more info.

Illustration

This can be illustrated using polynomial regressions. A polynomial regression of degree $1$ is a line regression, which have high bias but very low variance. As the polynomial degree increases, the bias reduces but the variance increases.

On the plot below:

the red curve is the deterministic relationship $\truemodel(\inputvec)$ ;
the red dots are observations $(\inputvec, \outputval)$ polluted by the noise $\epsilon$ (which explains why they are not on the line);
the blue curve is the regression line of one model $\trainedmodel$ .

Polynomial fitting

To vizualize the bias and variance tradeoff, we need to train several models. Let’s generate several trainset $\trainset \sim \mathcal{S}$ using the source. Recall that the relationship $\truemodel$ to be learned is the same accross datasets, but the noise observations are random.

On the plot below:

the red curve is the deterministic relationship $\truemodel$ ;
we didn’t plot the red dots to avoid cluter;
the blue curves are the regression lines for each of the trained model generated thus.

Variance in polynomial fitting

We can see that as the degree increases the blue curves are more and more appart from each other. This is a manifestation of the high variance.

Taking the average of the blue lines, we can visualize the bias. On the picture below, we graphed the mean of the regression curves in blue. The gray shape around it is the spread of all the regression curves.

Variance in polynomial fitting 2

We can see that for low degree polynomials, the blue curve does not fit the red curve: they have high bias. But the gray shape has small width: they have low variance.
On the other hand, for high degree polynomials, the blue curve perfectly match the red curve: they have low bias. But the gray shape has large width: they have high variance.