Models & Evaluation

Precision, Recall, and F1 Score

By Arpit Tripathi, Founder

Precision is the fraction of predicted positives that are correct, recall is the fraction of actual positives that are found, and F1 is their harmonic mean. They evaluate classifiers, especially on imbalanced data where accuracy misleads.

What is Precision, Recall, and F1 Score?

Precision, recall, and F1 score are classification metrics computed from the counts of true positives, false positives, and false negatives. Precision measures how many of the items the model labeled positive are actually positive. Recall measures how many of the truly positive items the model managed to find. F1 score is the harmonic mean of precision and recall, giving a single number that balances both.

These metrics matter most on imbalanced datasets, where one class is rare. Plain accuracy can look excellent simply by predicting the majority class for everything, while precision and recall expose whether the model actually identifies the minority class of interest.

  • Precision answers: of the positive predictions, how many were right?
  • Recall answers: of the actual positives, how many did we catch?
  • F1 combines both into one score, penalizing large gaps between them.

The formulas

All three metrics derive from the confusion matrix: true positives (TP), false positives (FP), and false negatives (FN). Precision divides correct positive predictions by all positive predictions. Recall (also called sensitivity or true positive rate) divides correct positive predictions by all actual positives.

F1 is the harmonic mean rather than the arithmetic mean. The harmonic mean is dominated by the smaller of the two values, so F1 is only high when precision and recall are both high.

Precision = TP / (TP + FP) Recall = TP / (TP + FN)
Precision is correctness of positive predictions; recall is coverage of actual positives.
F1 = 2 · (Precision · Recall) / (Precision + Recall)
F1 is the harmonic mean of precision and recall.
Fβ = (1 + β²) · (Precision · Recall) / (β² · Precision + Recall)
The F-beta score; β > 1 weights recall more, β < 1 weights precision more.
  • Precision and recall each range from 0 to 1, higher being better.
  • The harmonic mean makes F1 conservative: one weak component drags it down.
  • The generalized F-beta score weights recall more heavily (beta > 1) or precision more heavily (beta < 1).

When to optimize each

The right metric depends on the cost of each error type. When false positives are expensive, optimize precision. A spam filter that marks a real email as spam (a false positive) is worse than letting one spam through, so high precision is preferred. When false negatives are expensive, optimize recall. A cancer screening test that misses a real case (a false negative) is far worse than a false alarm, so high recall is preferred.

Precision and recall trade off against each other through the decision threshold. Lowering the threshold flags more items as positive, raising recall but usually lowering precision. F1 is the default choice when both error types carry comparable cost and a single balanced score is needed.

  • Optimize precision when false alarms are costly (spam, fraud flags shown to users).
  • Optimize recall when missed positives are costly (disease screening, safety alerts).
  • Use F1 when you need one balanced number and classes are imbalanced.
  • Adjust the classification threshold to move along the precision-recall curve.

Computing these metrics in scikit-learn

scikit-learn provides precision_score, recall_score, and f1_score, plus classification_report for a per-class summary. For multiclass problems, the average argument controls aggregation: macro treats every class equally, weighted accounts for class frequency, and micro pools all samples.

python
from sklearn.metrics import (precision_score, recall_score,
                             f1_score, classification_report)

y_true = [0, 1, 1, 1, 0, 1, 0, 0]
y_pred = [0, 1, 0, 1, 0, 1, 1, 0]

print("precision:", precision_score(y_true, y_pred))  # TP/(TP+FP)
print("recall:   ", recall_score(y_true, y_pred))     # TP/(TP+FN)
print("f1:       ", f1_score(y_true, y_pred))

# Per-class breakdown for multiclass or imbalanced data
print(classification_report(y_true, y_pred, digits=3))
Compute precision, recall, and F1 with scikit-learn.
  • classification_report gives precision, recall, and F1 for each class at once.
  • average='macro' is appropriate when every class matters equally regardless of size.
  • average='weighted' reflects class support, useful for imbalanced data summaries.
  • zero_division controls behavior when a class has no predicted or actual positives.

Accuracy can be misleading on imbalanced data: a model predicting the majority class for every input may score 95 percent accuracy while finding none of the rare positives. Precision and recall avoid this trap by focusing on the positive class.

Two summary tools help when the threshold is not fixed. The precision-recall curve and average precision (AP), a step-wise area-under-PR summary, capture performance across all thresholds and are preferred over ROC-AUC when positives are rare. Note that scikit-learn's average_precision_score uses a step-wise sum rather than trapezoidal interpolation, so it is not identical to auc(recall, precision). Reporting precision and recall together, rather than F1 alone, is good practice because F1 hides which side a model is weak on.

  • Average precision (AP) is a step-wise area-under-PR summary across thresholds.
  • Prefer the PR curve over ROC when the positive class is rare.
  • Always report precision and recall alongside F1 so the imbalance is visible.

Key takeaways

  • Precision is correctness of positive predictions; recall is coverage of actual positives.
  • F1 is the harmonic mean of precision and recall, high only when both are high.
  • Optimize precision when false positives are costly, recall when false negatives are costly.
  • On imbalanced data these metrics are far more informative than raw accuracy.
  • scikit-learn's classification_report gives per-class precision, recall, and F1 in one call.

Frequently asked questions

Precision measures how many predicted positives are actually correct, TP/(TP+FP). Recall measures how many actual positives were found, TP/(TP+FN). Precision is about avoiding false alarms; recall is about not missing real positives.
Accuracy can be high on imbalanced data even when the rare positive class is never detected. F1, the harmonic mean of precision and recall, focuses on the positive class and stays low if either precision or recall is poor.
Optimize recall when missing a positive is costly, such as disease screening or safety alerts, where a false negative is worse than a false alarm. Lowering the decision threshold typically raises recall at the expense of precision.
There is no universal threshold; it depends on the task and baseline. Compare F1 against a trivial baseline and domain requirements. A value near 1 means strong balanced performance, but always inspect precision and recall separately.
Use f1_score with an average argument: 'macro' averages per-class F1 equally, 'weighted' weights by class frequency, and 'micro' pools all samples. classification_report shows per-class scores plus these averages together.

Put the idea into practice

MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free