AI Foundations

Regularization (L1/L2)

Regularization adds a penalty on model parameter size to the training objective so the model fits the data without growing weights excessively. L1 (Lasso) drives some weights to exactly zero for sparsity; L2 (Ridge) shrinks all weights smoothly toward zero.

What is Regularization?

Regularization is a family of techniques that discourage a model from becoming too complex by adding a penalty term to the loss function that grows with the size of the model's parameters. Instead of only minimizing training error, the model minimizes training error plus a penalty, so it is pushed to find simpler weight configurations that generalize better to unseen data. This directly counters overfitting, where a model memorizes noise in the training set and performs poorly in production.

The two most common forms are L1 regularization, which penalizes the sum of absolute weight values, and L2 regularization, which penalizes the sum of squared weights. In linear regression these give rise to Lasso (L1) and Ridge (L2). A strength hyperparameter, often called lambda or alpha, controls how heavily the penalty counts relative to fitting the data.

Regularization trades a little training accuracy for better generalization.
It adds a penalty on weight magnitude to the loss, favoring simpler models.
The penalty strength (lambda or alpha) is a hyperparameter tuned via validation.

L2 (Ridge): smooth shrinkage

L2 regularization adds a penalty equal to lambda times the sum of squared weights. Because the squared term grows quickly for large weights, Ridge strongly discourages any single large coefficient and spreads influence across correlated features. It shrinks weights smoothly toward zero but rarely sets them exactly to zero, so all features stay in the model with reduced magnitude.

Ridge is the default choice when most features are believed to be at least somewhat relevant and when features are correlated, since it handles multicollinearity gracefully by distributing weight among correlated predictors rather than picking one arbitrarily.

J(w) = Loss(w) + λ · Σⱼ wⱼ²   (L2 / Ridge)

Ridge adds the squared L2 norm of the weights to the loss. The hyperparameter lambda controls shrinkage strength; larger lambda means smaller weights and a simpler model.

Penalty is lambda times the sum of squared weights.
Shrinks all weights smoothly; almost never produces exact zeros.
Handles correlated features well and is a strong default for dense signals.

L1 (Lasso): sparsity and feature selection

L1 regularization adds a penalty equal to lambda times the sum of absolute weight values. The geometry of the absolute-value penalty has corners on the axes, so the optimum often lands exactly on an axis, setting some weights to precisely zero. That makes Lasso a built-in feature selector: features whose weights become zero are effectively dropped from the model.

Lasso is preferred when the true model is believed to be sparse, meaning only a subset of features actually matter, or when an interpretable, smaller model is desired. Elastic Net combines L1 and L2 penalties to get both sparsity and the stability of Ridge under correlated features.

J(w) = Loss(w) + λ · Σⱼ |wⱼ|   (L1 / Lasso)

Lasso adds the L1 norm of the weights to the loss. Its corner geometry yields exact zeros for some coefficients, performing feature selection as lambda increases.

Penalty is lambda times the sum of absolute weights.
Drives some weights to exactly zero, performing automatic feature selection.
Elastic Net mixes L1 and L2 to balance sparsity with stability.

Using regularization in practice

In scikit-learn, Ridge and Lasso are drop-in linear models whose alpha argument is the penalty strength. Larger alpha means stronger regularization. The right value is found by cross-validation, and helper classes such as RidgeCV and LassoCV automate the search. Features should typically be standardized before applying L1 or L2, because the penalty acts on raw coefficient magnitudes and is sensitive to feature scale.

Regularization is not limited to linear models. Weight decay in neural networks is L2 regularization applied to network weights, and techniques like dropout and early stopping serve a related purpose of limiting effective model complexity.

python

from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

# X: features, y: target
ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0)).fit(X, y)
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.1)).fit(X, y)

lasso_coef = lasso.named_steps["lasso"].coef_
print("Ridge keeps all features (smooth shrinkage).")
print("Lasso zeroed-out features:", int(np.sum(lasso_coef == 0)))

Comparing Ridge (L2) and Lasso (L1) on standardized features in scikit-learn.

Standardize features first, since the penalty depends on coefficient scale.
Tune alpha with cross-validation; RidgeCV and LassoCV automate it.
Weight decay in deep learning is L2 regularization on the network weights.

Key takeaways

Regularization adds a penalty on weight size to the loss so models generalize instead of memorizing noise.
L2 (Ridge) shrinks all weights smoothly and handles correlated features well; it rarely produces exact zeros.
L1 (Lasso) drives some weights to exactly zero, giving automatic feature selection and sparse models.
Elastic Net combines L1 and L2 to get sparsity plus stability under correlation.
Standardize features and tune the penalty strength (alpha or lambda) with cross-validation.

Frequently asked questions

L1 (Lasso) penalizes the sum of absolute weights and drives some to exactly zero, performing feature selection. L2 (Ridge) penalizes the sum of squared weights and shrinks all of them smoothly toward zero without eliminating any, handling correlated features well.

The absolute-value penalty has corners on the coordinate axes, so the constrained optimum frequently lands exactly on an axis where some coefficients are zero. As the penalty strength grows, more weights are set to zero, yielding a sparse, interpretable model.

Use Ridge when most features are at least somewhat relevant or are correlated, since it distributes weight among them gracefully. Use Lasso when you expect only a few features to matter or want automatic feature selection and a smaller model.

Yes, that is its main purpose. By penalizing large weights, regularization limits model complexity so it captures the underlying signal rather than memorizing noise, which usually improves accuracy on unseen data when the penalty strength is well chosen.

Weight decay is L2 regularization applied to a neural network's weights. It adds a penalty proportional to the squared weight magnitude to the loss, pulling weights toward zero each update and reducing overfitting in deep models.

Yes. L1 and L2 penalties act on raw coefficient magnitudes, so features on larger scales would be penalized differently. Standardizing features to comparable scales before fitting ensures the penalty treats all features fairly.

Sources

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free