Underfitting and overfitting illustrated

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

In this article, we define underfitting and overfitting and show some nice ways to vizualize them on polynomial regressions.

In short

Underfitting and overfitting describe the ability of a machine-learning model to make good predictions on datasets it wasn’t trained on.

Underfitting happens when:

The model is too rigid to learn the true relationship in the data.
Both test error and train error are large.
The error is dominated by the bias error.

Overfitting happens when:

The model is not rigid enough and mistakes noise for signal.
The train error is low but the test error is large.
The error is dominated by the variance error.

Overfitting and underfitting regression line

The left hand side graph shows the underfitting scenario. We can see that the model is so rigid that is doesn’t fit the general shape of the data. This is called the bias of the model.
The graph in the middle displays a good fit. The model fits the general shape of the data but does not wiggle too much between data points. This is the good bias-variance equilibrium.
The right hand side graph shows the overfitting scenario. The model wiggles too much. And the general shape of the data is obfuscated by this high variance.

The signal and the noise

Let $\mathcal{X} \in \realvset{\inputdim}$ be a random vector. Given a deterministic function $\truemodel:\inputvec \mapsto \outputval \in \realset$ , we say that the random variable $\model\paren{\mathcal{X}}$ is a signal.

Let $\epsilon \sim \gaussian(0, \sigma)$ a gaussian noise with $0$ mean. The random variable $\mathcal{Y} = \model\paren{\mathcal{X}} + \epsilon$ is our signal polluted by the noise:

Y = \underset{signal}{\underset{⏟}{f (X)}} + \underset{noise}{\underset{⏟}{ϵ}}

$\mathcal{Y} = \underbrace{\model\paren{\mathcal{X}}}_{\text{signal}} + \underbrace{\epsilon}_{\text{noise}}$

The train dataset, $\trainset$ is made of $\ndataset$ independent observations of $(\mathcal{X}, \mathcal{Y})$ :

S_{train} = {({\vec{x}}_{n}, y_{n}) ∣ {\vec{x}}_{n} \sim X, y_{n} \sim Y}

$\trainset = \{(\ninputvec{\idataset}, \ioutputval{\idataset}) \mid \ninputvec{\idataset} \sim \mathcal{X},\, \ioutputval{\idataset} \sim \mathcal{Y}\}$

This means that for each observation $\idataset \leq \ndataset$ , we have:

y_{n} = f ({\vec{x}}_{n}) + ϵ_{n}

$\ioutputval{\idataset} = \model\paren{\ninputvec{\idataset}} + \epsilon_n$

Where:

$\ninputvec{\idataset}$ is a realization of $\mathcal{X}$ , whose value is known, and:
$\epsilon_n$ is a realization of $\epsilon$ whose value is unknown.

Learning the signal

The goal of machine-learning is to learn the signal function $\model$ from the inputs generated by $\mathcal{X}$ and the outputs generated by $\mathcal{Y}$ . This task is made complex because of the unknown noise observations $\epsilon_n$ .

To do so, we suppose a model $\model_{\text{ML}}$ for the function $\model$ and minimize the error between the observed outputs $\ioutputval{\idataset}$ and the predictions $\model_{\text{ML}}(\ninputvec{\idataset})$ .

While our end goal is to approximate $\model\paren{\inputvec}$ , we can only do so by comparing our predictions with $\model\paren{\inputvec} + \epsilon$ . Overfitting happens when our model “learns” the specifics of the noise realizations in the train dataset.

To illustrate overfitting, we will use polynomial regression. As the degree of the fitted polynomial increases, the model has more freedom to fit complex signals. But also more freedom to fit the unwanted noise.

This is illustrated on the picture below, where:

the red curve is the real signal $\model\paren{\mathcal{X}}$ ;
the red points are observed values for this signal, polluted by the random noise $\epsilon$ ;
the blue curve is the regression line learned by a polynomial regression.

Polynomial fitting

High variance

To better visualize the implications of overfitting on the regression curve, we can generate multiple train datasets, each with the same signal curve (in red), but with different random values for the noise $\epsilon$ . For instance:

S_{1} = {({\vec{x}}_{n}, f ({\vec{x}}_{n}) + ϵ_{n, 1}) ∣ n \leq N}

$\dataset_1 = \{(\ninputvec{\idataset}, \model\paren{\ninputvec{\idataset}} + \epsilon_{n, 1}) \mid \idataset \leq \ndataset\}$

and:

S_{2} = {({\vec{x}}_{n}, f ({\vec{x}}_{n}) + ϵ_{n, 2}) ∣ n \leq N}

$\dataset_2 = \{(\ninputvec{\idataset}, \model\paren{\ninputvec{\idataset}} + \epsilon_{n, 2}) \mid \idataset \leq \ndataset\}$

where $\epsilon_{n, 1}$ and $\epsilon_{n, 2}$ are different values for the noise, drawn from the same $\gaussian(0, 1)$ distribution.

Let’s generate a lot of train datasets like those.

If we fit a polynomial regression to each train dataset thus generated and we graph all the regression lines (in blue) on the same plot, we can visualize the high variance induced by overfitting. See picture below.

Variance in polynomial fitting

The Bias-Variance Tradeoff

Since we used a gaussian noise with $0$ mean, taking the mean of the regression curves should yield the signal. Indeed, as the number $\ndataset_s$ of train sets used increases, the mean of the noise observations converges to $0$ :

\frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} ϵ_{n, i} \overset{N_{s} \to \infty}{\to} 0

$\frac{1}{\ndataset_s}\sum_{i = 1}^{\ndataset_s} \epsilon_{n, i} \stackrel{\ndataset_s \to \infty}{\to} 0$

Hence:

\frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} f ({\vec{x}}_{n}) + ϵ_{n, i} \overset{N_{s} \to \infty}{\to} f ({\vec{x}}_{n})

$\frac{1}{\ndataset_s}\sum_{i = 1}^{\ndataset_s} \model\paren{\ninputvec{\idataset}} + \epsilon_{n, i} \stackrel{\ndataset_s \to \infty}{\to} \model\paren{\ninputvec{\idataset}}$

On the picture below, we graphed the mean of the regression curves in blue. The gray shape around it is the spread of all the regression curves.

We can see that for low degree polynomials, the blue curve does not fit the red curve. We say that they have high bias. But the gray shape has small width. We say they have low variance.
On the other hand, for high degree polynomials, the blue curve perfectly match the red curve. We say they have low bias. But the gray shape has large width. We say they have high variance.

Variance in polynomial fitting 2