The effect of L2-regularization

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

When fitting a model to some training dataset, we want to avoid overfitting. A common method to do so is to use regularization. In this article, we discuss the impact of L2-regularization on the estimated parameters of a linear model.

What is L2-regularization

L2-regularization adds a regularization term to the loss function. The goal is to prevent overfiting by penalizing large parameters in favor of smaller parameters. Let $\sets$ be some dataset and $\vw$ the vector of parameters:

$\l_\text{reg}(\sets, \vw) = \underbrace{\l(\sets, \vw)}_\text{loss} + \underbrace{\lambda\normtwo{\vw}^2}_\text{regularizer}$

Where $\lambda$ is an hyperparameter that controls how important the regularization

The effect of the hyperparameter

Increasing the $\lambda$ hyperparameter moves the optimal of $\l_\text{reg}$ closer to $0$ , and away from the optimal for the unregularized loss $\l$ .

This can be visualized using one feature $\sx$ and a dataset made of one sample $(\sx, \sy)$ . Take (for instance) the following loss:

$\l(\sx, \{\sx\}) = (x - 1)^2 + \frac{1}{6}(x - 1)^3$

The regularization term is:

$\lambda\normtwo{\sx}^2 = \lambda\sx^2$

On the graph below, we plotted this loss function (first graph) and several variant of the corresponding $\l_\text{reg}$ for different values of $\lambda = 0, 1, 2, 3, 4$ . As the value of $\lambda$ increases, the loss curve is translated towards $0$ .

L2 regularization

This can be vizualized in 2D, where we see that the optimum of the regularized loss approaches $\vec{0}$ more and more when $\lambda$ increases. However, there is another way to conceptualize the regularization, which we will present next.

Regularization seen as constrained optimization

To understand how L2-regularization impacts the parameters, we will use an example in $\realvset{2}$ .

Let’s note $\beta = (\beta_1, \beta_2)$ the vector of parameters.

Our estimate is:

$\hat{\beta} = \argmin_{\beta} \l(\sets, \beta) + \lambda\normtwo{\beta}^2$

Which is equivalent to a constrained optimization problem:

	minimize: $\l(\sets, \beta)$
	subject to: $\normtwo{\beta}^2 \leq \ss^2$

This formulation is easier to interpret: the selected vector of parameters $\hat{\beta}$ is the vector that minimzes the loss, among all vector inside the ball of radius $\ss$ .

This is illustrated on the picture below. The red contour lines are the contour lines of the loss function $\l$ . The unregularized optimal $\hat{\beta}$ is indicated by a black dot at the location of the minimum of $\l$ . The ball of radius $\ss$ is drawn in blue. The solution to the constrained optimization is the intersection of the contour lines and the ball.

L2 regularization geometry

Effect on the individual parameters

What is the effect of the regularization on the individual parameters $\beta_1$ and $\beta_2$ ?

Regularized optimization will estimate $\beta$ such that less influential features undergo more penalization and therefore get shrunk down more.

On the plot above, $\beta_2$ crosses the gradient more rapidly that $\beta_1$ (we can see this as the contour lines are less separated along the $\beta_2$ axis than along the $\beta_1$ axis). When both $\beta_1$ and $\beta_2$ are standardized, this means that $\beta_2$ is more influential than $\beta_1$ .

As a result, $\beta_1$ is more penalized by the L2-constraint than $\beta_2$ .

Regularization of influential parameters

This phenomenon explains why we should normalize the features before using regularization. More details on this in what follows.

Effect when approaching 0

Since the gradient of $\normtwo{\cdot}$ vanishes around $0$ , the optimal will never move there if it was not already there. In other words: the parameters $\beta_\si$ are pushed towards $0$ , but they are never set to $0$ .

Another regularization method, the L1 regularization has a different behavior: since the gradient around $0$ do not vanish, the parameters $\beta_\si$ are pushed towards $0$ and may attain it and remain there. is.

Important remark about normalization

The regularizer term $\normtwo{\vw}^2$ treats each component the same way:

$\normtwo{\vw}^2 = \sum_{\si = 1}^{\sd} \sw_{\si}^2$

Therefore it is important to unify the scale of each feature vector $\vf_\si$ . Indeed, suppose that the optimal solution is $\hat{\vw}$ for some design matrix $\mx$ where:

$\mx = \begin{pmatrix} \uparrow & & \uparrow \\ \vf_1, & \dotsc, & \vf_\sd \\ \downarrow & & \downarrow \end{pmatrix}$

The estimate $\hat{\vy}$ is:

$\hat{\vy} = \mx\hat{\vw} = \sum_{\si = 1}^{\sd} \sw_{\si}\vf_{\si}$

If we rescale one column of the design matrix, say $\vf^{r}_{0} = \epsilon\vf_{0}$ , then we don’t bring new information and the estimate $\hat{\vy}$ should not change. The parameter $\sw^{r}_{0}$ is inversely rescaled:

$\hat{\vy} = \mx\hat{\vw} = \mx^{r}\hat{\vw}^{r}$

And:

$\sw_0\vf_0 = \sw^{r}_0\vf^{r}_0 = \frac{\sw_0}{\epsilon}(\epsilon\vf_0)$

But the parameters $\sw_0$ and $\sw^{r}_0 = \epsilon^{-1}\sw_0$ will be penalized with the same strength by the L2 regularization.

Therefore, to avoid mistaking influential parameters with non-influential ones because of scaling factors within the feature vector, we must normalize the features.

By norming the columns of $\mx$ , we put them on the same scale. Consequently, differences in the magnitudes of the components of $\vw$ are directly related to the wiggliness of the regression function $\mx\vw$ , which is loosely speaking what the regularization tries to control.

TL;DR: before using regularization, transform your feature vectors into unit vectors.