When fitting a model to some training dataset, we want to avoid overfitting. A common method to do so is to use regularization. In this article, we discuss the impact of L2-regularization on the estimated parameters of a linear model. What is L2-regularization L2-regularization adds a regularization term to the loss function. The goal is to prevent overfiting by penalizing large parameters in favor of smaller parameters. Let Where hyperparameter that controls how important the regularization The effect of the hyperparameter Increasing the hyperparameter moves the optimal of This can be visualized using one feature The regularization term is: On the graph below, we plotted this loss function (first graph) and several variant of the corresponding This can be vizualized in 2D, where we see that the optimum of the regularized loss approaches regularization, which we will present next. Regularization seen as constrained optimization To understand how L2-regularization impacts the parameters, we will use an example in Let’s note Our estimate is: Which is equivalent to a constrained optimization problem: minimize: subject to: This formulation is easier to interpret: the selected vector of parameters This is illustrated on the picture below. The red contour lines are the contour lines of the loss function Effect on the individual parameters What is the effect of the regularization on the individual parameters Regularized optimization will estimate features undergo more penalization and therefore get shrunk down more. On the plot above, gradient more rapidly that As a result, This phenomenon explains why we should normalize the features before using regularization. More details on this in what follows. Effect when approaching 0 Since the gradient of Another regularization method, the L1 regularization has a different behavior: since the gradient around Important remark about normalization The regularizer term Therefore it is important to unify the scale of each feature vector design matrix The estimate If we rescale one column of the design matrix, say And: But the parameters Therefore, to avoid mistaking influential parameters with non-influential ones because of scaling factors within the feature vector, we must normalize the features. By norming the columns of regularization tries to control. TL;DR: before using regularization, transform your feature vectors into unit vectors.