AI Foundations

Logistic Regression

Logistic regression is a linear classification algorithm that models the probability of a class by passing a weighted sum of input features through the sigmoid function, then trains the weights by minimizing log loss.

What is Logistic Regression?

Logistic regression is a supervised learning algorithm for classification that estimates the probability an input belongs to a particular class. Despite the word regression in its name, it predicts a probability between 0 and 1 rather than a continuous quantity, and a decision threshold (commonly 0.5) converts that probability into a class label.

The model computes a linear combination of the input features, z = w·x + b, and then squashes z into a probability using the logistic (sigmoid) function. Training adjusts the weight vector w and bias b so that predicted probabilities match the observed labels as closely as possible, measured by the log loss (cross-entropy) objective.

Output is a well-behaved probability estimate, not just a hard label.
The decision boundary is linear in the feature space.
Binary logistic regression extends to multiple classes via softmax (multinomial) or one-vs-rest schemes.

The sigmoid function and the model

The core of binary logistic regression is the sigmoid function, which maps any real number to the open interval (0, 1). A large positive z yields a probability near 1, a large negative z yields a probability near 0, and z = 0 maps to exactly 0.5.

Because the log-odds (logit) of the predicted probability equals the linear term z, the coefficients have a direct interpretation: each weight is the change in log-odds per unit change in its feature, holding others fixed.

σ(z) = 1 / (1 + e^(−z)),   where z = w·x + b

The sigmoid maps the linear score z into a probability between 0 and 1.

P(y=1 | x) = σ(w·x + b),   logit(p) = ln(p / (1−p)) = w·x + b

The predicted probability and its log-odds, which are linear in the features.

The logit transform makes the relationship between features and log-odds linear.
Exponentiating a coefficient gives an odds ratio, useful for interpretation.
The model is a single-layer neural network with one sigmoid output unit.

Training with log loss

Logistic regression has no closed-form solution like ordinary least squares, so the weights are fit by numerically minimizing the binary cross-entropy, also called log loss. This loss is convex in the parameters, which means gradient-based optimizers converge to the global minimum.

Solvers such as L-BFGS, Newton-CG, SAG, SAGA, and coordinate descent (liblinear) are used in practice. Regularization (L2 by default, or L1 for sparsity) is added to the objective to control overfitting, especially when features are numerous or correlated.

L = −(1/N) Σᵢ [ yᵢ ln(pᵢ) + (1−yᵢ) ln(1−pᵢ) ]

Binary cross-entropy (log loss) averaged over N training examples, where pᵢ is the predicted probability for example i.

Log loss is convex, so optimization avoids local minima.
L2 regularization shrinks coefficients; L1 can drive some to exactly zero.
The gradient of log loss with respect to z is simply the prediction error (p − y).

Using logistic regression in scikit-learn

The scikit-learn library provides LogisticRegression, which fits binary or multinomial models and exposes both class predictions and probability estimates. The example below trains a classifier and reads off probabilities and learned coefficients.

Recent scikit-learn versions deprecated the penalty argument in favor of l1_ratio, the elastic-net mixing parameter: l1_ratio=0.0 is pure L2 (the default behavior), l1_ratio=1.0 is pure L1, and intermediate values blend the two. The example uses l1_ratio=0.0 to request L2 regularization without triggering a deprecation warning.

python

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)

scaler = StandardScaler().fit(X_tr)
X_tr, X_te = scaler.transform(X_tr), scaler.transform(X_te)

# l1_ratio=0.0 requests pure L2 regularization (penalty= is deprecated)
clf = LogisticRegression(C=1.0, l1_ratio=0.0, solver="lbfgs", max_iter=1000)
clf.fit(X_tr, y_tr)

probs = clf.predict_proba(X_te)[:, 1]   # P(y=1) per sample
print("accuracy:", clf.score(X_te, y_te))
print("intercept:", clf.intercept_[0])
print("first 3 coefs:", clf.coef_[0][:3])

Fit a logistic regression and inspect probabilities and coefficients.

predict_proba returns per-class probability estimates.
The C parameter is the inverse of regularization strength: smaller C means stronger regularization.
l1_ratio=0.0 selects pure L2; l1_ratio=1.0 selects pure L1; values in between give elastic net.
Scaling features (for example with StandardScaler) helps the solver converge and keeps regularization fair across features.

When to use it

Logistic regression is a strong baseline for any binary or multiclass classification problem with tabular features. It is fast to train, interpretable, and produces well-behaved probabilities, which matters when downstream decisions depend on confidence rather than only the predicted class. When tightly calibrated probabilities are essential, wrap the model in a calibration step such as CalibratedClassifierCV.

Its main limitation is that the decision boundary is linear in the input features. Nonlinear relationships require manual feature engineering, polynomial or interaction terms, or a different model family such as gradient boosting or neural networks.

Excellent interpretable baseline before reaching for complex models.
Handles high-dimensional sparse data well, especially with L1 regularization.
Struggles with nonlinear boundaries unless features are transformed.

Key takeaways

Logistic regression predicts class probabilities by applying the sigmoid to a linear score w·x + b.
It is trained by minimizing log loss, a convex objective with no closed-form solution.
Coefficients are interpretable as changes in log-odds, and exponentiating them gives odds ratios.
The decision boundary is linear, so nonlinear problems need feature engineering or other models.
scikit-learn's LogisticRegression exposes predict_proba and supports L1/L2 regularization via C and l1_ratio.

Frequently asked questions

It is a classification algorithm. The name comes from the logistic (logit) function and its roots in regression theory, but it predicts class probabilities and assigns labels rather than predicting continuous values.

The sigmoid maps any real-valued linear score into the range (0, 1), so the output can be read as a probability. It also makes the log-odds of the prediction a linear function of the input features.

It minimizes log loss, also called binary cross-entropy. This loss is convex in the model parameters, so gradient-based solvers reliably reach the global optimum.

Two ways: multinomial (softmax) logistic regression, which models all classes jointly, or one-vs-rest, which trains a separate binary classifier per class. scikit-learn supports both.

It is not strictly required for correctness, but scaling features helps iterative solvers converge faster and makes regularization penalize each feature fairly. StandardScaler is a common choice.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free