Models & Evaluation

Data Drift (and Concept Drift)

Data drift is a change in the statistical distribution of a model's input features over time, while concept drift is a change in the relationship between inputs and the target. Both degrade a deployed model's accuracy and must be monitored in production.

What is Data Drift?

Data drift is a change over time in the statistical distribution of the input data a model sees in production compared with the data it was trained on. The features still mean the same thing, but their distribution shifts: a new user segment appears, a sensor recalibrates, prices inflate, or seasonal behavior changes. Because the model learned patterns on the old distribution, its predictions can become less reliable even though the code and weights are unchanged.

Data drift is distinct from concept drift. With data drift the input distribution P(X) changes; with concept drift the conditional relationship P(Y given X) changes, meaning the same inputs now map to different correct answers. A fraud model can suffer concept drift when fraudsters change tactics, so identical transaction features that were once benign become fraudulent.

Data drift: the distribution of inputs P(X) shifts over time.
Concept drift: the input-to-target relationship P(Y given X) shifts over time.
Both can occur together and both degrade accuracy without any change to the model.

Types of drift and why they happen

Covariate shift is the most common form of data drift, where feature distributions move but the underlying labeling rule stays fixed. Label shift (prior probability shift) is a change in the distribution of the target class frequencies. Concept drift can be sudden, gradual, incremental, or recurring (for example seasonal patterns that return each year).

A related but separate problem is training-serving skew, a discrepancy between how features are computed in the training and serving pipelines. Skew is not time-based drift; it is an engineering mismatch, such as a feature transformed one way in the training pipeline and another way in the live service. Google's Rules of Machine Learning notes that scoring the same example in both pipelines should give exactly the same result, so a difference probably indicates an engineering error. It produces drift-like symptoms but is fixed in code, not by retraining.

Covariate shift: P(X) changes, labeling rule unchanged.
Label shift: the class prior P(Y) changes.
Training-serving skew: features computed differently in the training vs serving pipelines, an engineering bug, not temporal drift.

How to detect drift

Detection compares a reference window (often the training or a known-good period) against a current production window. For numeric features, a two-sample test such as the Kolmogorov-Smirnov test or a distance like Wasserstein flags distribution change. For categorical features, the chi-squared test or population stability index (PSI) is common. Concept drift is detected by monitoring model quality metrics over time when ground-truth labels eventually arrive, or by proxy signals like prediction confidence when labels are delayed.

A widely used summary metric is the population stability index, which compares the binned distribution of a variable between reference and current sets. A common rule of thumb treats PSI below 0.1 as no significant shift, 0.1 to 0.25 as moderate, and above 0.25 as major.

PSI = Σᵢ (actualᵢ − expectedᵢ) · ln(actualᵢ / expectedᵢ)

Population stability index sums over bins i, where actual and expected are the proportion of records in each bin for the current and reference sets. Higher PSI means more drift; above about 0.25 signals a major shift.

python

from scipy.stats import ks_2samp
import numpy as np

# reference = training-time feature values, current = recent production values
reference = np.load("reference_feature.npy")
current = np.load("current_feature.npy")

stat, p_value = ks_2samp(reference, current)

if p_value < 0.05:
    print(f"Drift detected (KS={stat:.3f}, p={p_value:.4f}) -> investigate / retrain")
else:
    print(f"No significant drift (KS={stat:.3f}, p={p_value:.4f})")

Detecting numeric feature drift with a Kolmogorov-Smirnov test in SciPy.

Kolmogorov-Smirnov and Wasserstein distance for numeric feature drift.
Chi-squared test and population stability index for categorical drift.
Track accuracy, AUC, or error directly once labels arrive to catch concept drift.

Responding to drift

Not every detected drift requires action. The decision should hinge on whether model performance has actually degraded for the business metric that matters. Cheap responses include alerting and investigating root cause; heavier responses include retraining on recent data, adding the new segment to the training set, or rolling back to a safer model. Scheduled retraining and continuous evaluation pipelines turn drift handling into routine operations rather than emergencies.

Monitoring should cover inputs, predictions, and outcomes. Google's production ML guidance recommends monitoring the freshness and distribution of input features, the distribution of predictions, and downstream metrics, because drift can appear in any of these layers before it shows up as a visible accuracy drop.

Alert on drift, but gate retraining on real performance loss to avoid churn.
Retrain on recent labeled data or expand the training set to cover new segments.
Monitor inputs, predictions, and outcomes together, since drift can surface in any layer.

Key takeaways

Data drift changes input distributions P(X); concept drift changes the input-to-target relationship P(Y given X).
Both degrade a deployed model over time even though its code and weights never change.
Detect drift by comparing a reference window to a current window using KS tests, chi-squared, PSI, or Wasserstein distance.
Training-serving skew looks like drift but is an engineering mismatch fixed in code, not by retraining.
Act on drift based on real performance loss, and monitor inputs, predictions, and outcomes together.

Frequently asked questions

Data drift is a change in the distribution of input features P(X) over time. Concept drift is a change in the relationship between inputs and the target P(Y given X), so the same inputs map to different correct answers. Both reduce a deployed model's accuracy.

Compare a reference window to a recent production window. Use Kolmogorov-Smirnov or Wasserstein distance for numeric features, chi-squared or population stability index for categorical features, and track model accuracy directly once ground-truth labels arrive.

A common rule of thumb treats a population stability index below 0.1 as no significant shift, 0.1 to 0.25 as moderate change worth watching, and above 0.25 as a major shift that usually warrants investigation or retraining.

Training-serving skew is a discrepancy between how features are computed in the training and serving pipelines, such as a different transformation in each. It mimics drift symptoms but is an engineering bug fixed in code, not a time-based distribution change.

No. Many drifts do not hurt the metric that matters. Best practice is to alert and investigate, then gate retraining on actual performance degradation, since unnecessary retraining adds cost and can introduce regressions.

Concept drift happens when the real-world relationship a model captures changes: user preferences shift, fraud tactics evolve, or external conditions move. It can be sudden, gradual, incremental, or recurring, such as seasonal patterns that return each year.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free