Polynomial basis expansion

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

Polynomial basis expansion, also called polynomial features augmentation, is part of the machine-learning preprocessing. It consists in adding powers of the input’s components to the input vector.

Example

Let $\dataset$ be a dataset for a machine-learning task. As usual, $\dataset$ is made of $\ndataset$ pairs of input vectors $\ninputvec{\idataset}$ and output value $\ioutputval{\idataset}$ :

S = {({\vec{x}}_{n}, y_{n}) ∣ n \leq N}

$\dataset = \{(\ninputvec{\idataset}, \ioutputval{\idataset}) \mid \idataset \leq \ndataset\}$

For clarity, suppose the dimensionality is two: $\ninputvec{\idataset} \in \realset^2$ .

The polynomial augmentation of degree $2$ for the input vector $\ninputvec{\idataset}$ :

{\vec{x}}_{n} = (\begin{matrix} x_{n, 1} \\ x_{n, 2} \end{matrix})

$\ninputvec{\idataset} = \begin{pmatrix}\niinputvec{n}{1} \\ \niinputvec{n}{2}\end{pmatrix}$

Is the vector $\Phi_2(\ninputvec{\idataset})$ :

Φ_{2} ({\vec{x}}_{n}) = (\begin{matrix} x_{n, 1} \\ x_{n, 2} \\ (x_{n, 1})^{2} \\ (x_{n, 2})^{2} \end{matrix})

$\Phi_2(\ninputvec{\idataset}) = \begin{pmatrix}\niinputvec{n}{1} \\ \niinputvec{n}{2} \\ (\niinputvec{n}{1})^2 \\ (\niinputvec{n}{2})^2\end{pmatrix}$

Likewise, the degree $3$ augmentation is:

Φ_{3} ({\vec{x}}_{n}) = (\begin{matrix} x_{n, 1} \\ x_{n, 2} \\ (x_{n, 1})^{2} \\ (x_{n, 2})^{2} \\ (x_{n, 1})^{3} \\ (x_{n, 2})^{3} \end{matrix})

$\Phi_3(\ninputvec{\idataset}) = \begin{pmatrix}\niinputvec{n}{1} \\ \niinputvec{n}{2} \\ (\niinputvec{n}{1})^2 \\ (\niinputvec{n}{2})^2 \\ (\niinputvec{n}{1})^3 \\ (\niinputvec{n}{2})^3\end{pmatrix}$

and so on…

Polynomial regression and linear models

Polynomial inputs augmentation enriches the expressive power of linear models. Indeed, a polynomial regression of degree $d$ is simply a linear regression on the augmented inputs.

In other words, to fit the polynomial $P(x)$ :

P (x) = a_{1} x + a_{2} x^{2} + a_{3} x^{3}

$P(x) = a_1 x + a_2 x^2 + a_3 x^3$

To a dataset $\dataset = \{(\ninputvec{\idataset}, \ioutputval{\idataset}) \mid \idataset \leq \ndataset\}$ , it is enough to fit a linear function $\linmodel{a}$ to the augmented dataset:

Φ_{3} (S) = {(Φ_{3} ({\vec{x}}_{n}), y_{n}) ∣ n \leq N}

$\Phi_3(\dataset) = \{(\Phi_3(\ninputvec{\idataset}), \ioutputval{\idataset}) \mid \idataset \leq \ndataset\}$

This is why a polynomial regression is a linear regression.

Notations

To keep the notations simple, we often consider that the polynomial inputs augmentation is done in the preprocessing and we note $\inputvec$ to denote the augmented vector $\Phi_d(\inputvec)$ .

How much richer does linear regression become?

Very much! By the Stone-Weierstrass theorem (wikipedia link), every continuous function $\model$ can be uniformly approximated (as closely as desired) by a polynomial function on a closed interval.

Since our dataset is discrete, this means that whatever the relationship $\truemodel$ between the inputs and the outputs, we can get as close as we want using polynomials (as long as the degree $d$ is big enough).

Actually, since the train set is finite, we can always find a polynomial of degree $d = \ndataset-1$ that goes through every points in the train set. This means that for $d = \ndataset-1$ , we can make the train error to be $0$ . This is called the Lagrange polynomial interpolation (wikipedia link).

Wait… we can make the train error to be $0$ ? Yes, but…

So, is it the ultimate machine-learning technique?

No, because…

Polynomial regressions of high degree tend to overfit. If you’re not sure what that means, check out my dedicated article, which is completely writen and illustrated using polynomial regressions: overfiting.

Overfitting and underfitting regression line

The regression matrix $X$ that we have to inverse grows linearly as the regression’s degree $d$ grows (since it has $\ndataset d$ rows). This yields computational complications.

Numerical errors accumulate when we take the power of a number: large powers of small values are rounded to $0$ in floating point arithmetic. So even is there exists a mathematical solution, we might not be able to implement it.

How to improve polynomial regressions?

To mitigate overfitting, we can use regularization. For instance, a ridge regression or a lasso regression.
To find the best degree $d$ for a polynomial regression, we should use model selection.