The effect of L2-regularization

Nov 07, 2018

When fitting a model to some training dataset, we want to avoid overfitting. A common method to do so is to use regularization. In this article, we discuss the impact of L2-regularization on the estimated parameters of a linear model.

What is L2-regularization

L2-regularization adds a regularization term to the loss function. The goal is to prevent overfiting by penalizing large parameters in favor of smaller parameters. Let S be some dataset and w the vector of parameters:

Lreg(S,w)=L(S,w)loss+λw22regularizer

Where λ is an hyperparameter that controls how important the regularization

The effect of the hyperparameter

Increasing the λ hyperparameter moves the optimal of Lreg closer to 0, and away from the optimal for the unregularized loss L.

This can be visualized using one feature x and a dataset made of one sample (x,y). Take (for instance) the following loss:

L(x,{x})=(x1)2+16(x1)3

The regularization term is:

λx22=λx2

On the graph below, we plotted this loss function (first graph) and several variant of the corresponding Lreg for different values of λ=0,1,2,3,4. As the value of λ increases, the loss curve is translated towards 0.

L2 regularization

This can be vizualized in 2D, where we see that the optimum of the regularized loss approaches 0 more and more when λ increases. However, there is another way to conceptualize the regularization, which we will present next.

Regularization seen as constrained optimization

To understand how L2-regularization impacts the parameters, we will use an example in R2.

Let’s note β=(β1,β2) the vector of parameters.

Our estimate is:

ˆβ=argminβL(S,β)+λβ22

Which is equivalent to a constrained optimization problem:

  minimize: L(S,β)
  subject to: β22s2

This formulation is easier to interpret: the selected vector of parameters ˆβ is the vector that minimzes the loss, among all vector inside the ball of radius s.

This is illustrated on the picture below. The red contour lines are the contour lines of the loss function L. The unregularized optimal ˆβ is indicated by a black dot at the location of the minimum of L. The ball of radius s is drawn in blue. The solution to the constrained optimization is the intersection of the contour lines and the ball.

L2 regularization geometry

Effect on the individual parameters

What is the effect of the regularization on the individual parameters β1 and β2 ?

Regularized optimization will estimate β such that less influential features undergo more penalization and therefore get shrunk down more.

On the plot above, β2 crosses the gradient more rapidly that β1 (we can see this as the contour lines are less separated along the β2 axis than along the β1 axis). When both β1 and β2 are standardized, this means that β2 is more influential than β1.

As a result, β1 is more penalized by the L2-constraint than β2.

Regularization of influential parameters

This phenomenon explains why we should normalize the features before using regularization. More details on this in what follows.

Effect when approaching 0

Since the gradient of 2 vanishes around 0, the optimal will never move there if it was not already there. In other words: the parameters βi are pushed towards 0, but they are never set to 0.

Another regularization method, the L1 regularization has a different behavior: since the gradient around 0 do not vanish, the parameters βi are pushed towards 0 and may attain it and remain there. is.

Important remark about normalization

The regularizer term w22 treats each component the same way:

w22=di=1w2i

Therefore it is important to unify the scale of each feature vector fi. Indeed, suppose that the optimal solution is ˆw for some design matrix X where:

X=(f1,,fd)

The estimate ˆy is:

ˆy=Xˆw=di=1wifi

If we rescale one column of the design matrix, say fr0=ϵf0, then we don’t bring new information and the estimate ˆy should not change. The parameter wr0 is inversely rescaled:

ˆy=Xˆw=Xrˆwr

And:

w0f0=wr0fr0=w0ϵ(ϵf0)

But the parameters w0 and wr0=ϵ1w0 will be penalized with the same strength by the L2 regularization.

Therefore, to avoid mistaking influential parameters with non-influential ones because of scaling factors within the feature vector, we must normalize the features.

By norming the columns of X, we put them on the same scale. Consequently, differences in the magnitudes of the components of w are directly related to the wiggliness of the regression function Xw, which is loosely speaking what the regularization tries to control.

TL;DR: before using regularization, transform your feature vectors into unit vectors.