When fitting a model to some training dataset, we want to avoid overfitting. A common method to do so is to use regularization. In this article, we discuss the impact of L2-regularization on the estimated parameters of a linear model.
What is L2-regularization
L2-regularization adds a regularization term to the loss function. The goal is to prevent overfiting by penalizing large parameters in favor of smaller parameters. Let S be some dataset and →w the vector of parameters:
Lreg(S,→w)=L(S,→w)⏟loss+λ‖→w‖22⏟regularizerWhere λ is an hyperparameter that controls how important the regularization
The effect of the hyperparameter
Increasing the λ hyperparameter moves the optimal of Lreg closer to 0, and away from the optimal for the unregularized loss L.
This can be visualized using one feature x and a dataset made of one sample (x,y). Take (for instance) the following loss:
L(x,{x})=(x−1)2+16(x−1)3The regularization term is:
λ‖x‖22=λx2On the graph below, we plotted this loss function (first graph) and several variant of the corresponding Lreg for different values of λ=0,1,2,3,4. As the value of λ increases, the loss curve is translated towards 0.
This can be vizualized in 2D, where we see that the optimum of the regularized loss approaches →0 more and more when λ increases. However, there is another way to conceptualize the regularization, which we will present next.
Regularization seen as constrained optimization
To understand how L2-regularization impacts the parameters, we will use an example in R2.
Let’s note β=(β1,β2) the vector of parameters.
Our estimate is:
ˆβ=argminβL(S,β)+λ‖β‖22Which is equivalent to a constrained optimization problem:
minimize: L(S,β) | |
subject to: ‖β‖22≤s2 |
This formulation is easier to interpret: the selected vector of parameters ˆβ is the vector that minimzes the loss, among all vector inside the ball of radius s.
This is illustrated on the picture below. The red contour lines are the contour lines of the loss function L. The unregularized optimal ˆβ is indicated by a black dot at the location of the minimum of L. The ball of radius s is drawn in blue. The solution to the constrained optimization is the intersection of the contour lines and the ball.
Effect on the individual parameters
What is the effect of the regularization on the individual parameters β1 and β2 ?
Regularized optimization will estimate β such that less influential features undergo more penalization and therefore get shrunk down more.
On the plot above, β2 crosses the gradient more rapidly that β1 (we can see this as the contour lines are less separated along the β2 axis than along the β1 axis). When both β1 and β2 are standardized, this means that β2 is more influential than β1.
As a result, β1 is more penalized by the L2-constraint than β2.
This phenomenon explains why we should normalize the features before using regularization. More details on this in what follows.
Effect when approaching 0
Since the gradient of ‖⋅‖2 vanishes around 0, the optimal will never move there if it was not already there. In other words: the parameters βi are pushed towards 0, but they are never set to 0.
Another regularization method, the L1 regularization has a different behavior: since the gradient around 0 do not vanish, the parameters βi are pushed towards 0 and may attain it and remain there. is.
Important remark about normalization
The regularizer term ‖→w‖22 treats each component the same way:
‖→w‖22=d∑i=1w2iTherefore it is important to unify the scale of each feature vector →fi. Indeed, suppose that the optimal solution is ˆ→w for some design matrix X where:
X=(↑↑→f1,…,→fd↓↓)The estimate ˆ→y is:
ˆ→y=Xˆ→w=d∑i=1wi→fiIf we rescale one column of the design matrix, say →fr0=ϵ→f0, then we don’t bring new information and the estimate ˆ→y should not change. The parameter wr0 is inversely rescaled:
ˆ→y=Xˆ→w=Xrˆ→wrAnd:
w0→f0=wr0→fr0=w0ϵ(ϵ→f0)But the parameters w0 and wr0=ϵ−1w0 will be penalized with the same strength by the L2 regularization.
Therefore, to avoid mistaking influential parameters with non-influential ones because of scaling factors within the feature vector, we must normalize the features.
By norming the columns of X, we put them on the same scale. Consequently, differences in the magnitudes of the components of →w are directly related to the wiggliness of the regression function X→w, which is loosely speaking what the regularization tries to control.
TL;DR: before using regularization, transform your feature vectors into unit vectors.