BM25 is a sparse keyword ranking function that scores documents for a query using term frequency, inverse document frequency, and document length normalization. It remains a strong baseline for full-text search.
What is BM25 (Okapi BM25)?
BM25, also called Okapi BM25, is a ranking function used in information retrieval to estimate how relevant a document is to a search query based on the query's keywords. It is a sparse, lexical method: it matches exact terms rather than semantic meaning, and it scores documents using term frequency, inverse document frequency, and a correction for document length. The name comes from the Okapi system at City University London where it was developed.
BM25 is the default ranking function in many full-text search engines and is widely used as the keyword half of hybrid search systems. Despite being decades old and using no neural network, it remains a competitive baseline that newer dense retrievers are routinely measured against.
- A sparse keyword ranking function, not a semantic model.
- Built from term frequency, inverse document frequency, and length normalization.
- Default scorer in many full-text engines and the lexical side of hybrid search.
The BM25 formula
For a query q containing terms t and a document D, BM25 sums a per-term score over the query terms. Each term's contribution multiplies an inverse document frequency weight by a saturating function of the term frequency in the document, adjusted for the document's length relative to the average.
What the parameters do
The parameter k₁ controls term-frequency saturation. Because tf appears in both the numerator and denominator, the contribution of repeated occurrences of a term levels off rather than growing without bound, so the tenth occurrence adds far less than the second. Larger k₁ makes saturation slower; common default values are around 1.2 to 2.0, with 1.2 widely used.
The parameter b controls document length normalization, ranging from 0 to 1. At b equal to 1, scores are fully normalized by length so long documents are penalized for their size; at b equal to 0, length is ignored entirely. A common default is 0.75. The IDF term gives rare query words more weight than frequent ones, which prevents stopword-like terms from dominating the score.
- k₁ controls how quickly term frequency saturates; a typical default is 1.2.
- b controls length normalization from 0 (off) to 1 (full); a typical default is 0.75.
- IDF down-weights common terms and boosts rare, discriminating ones.
Strengths, limits, and hybrid search
BM25 is fast, interpretable, requires no training, and works well for queries with specific keywords, exact identifiers, names, codes, and rare terms that dense embeddings can blur. Its main limitation is vocabulary mismatch: it cannot match a query to a document that expresses the same idea with different words, because it relies on lexical overlap.
Modern retrieval systems often combine BM25 with a dense semantic retriever in a hybrid setup, then merge the two ranked lists, commonly with reciprocal rank fusion. The keyword side catches exact matches and rare terms while the semantic side catches paraphrases, and together they typically outperform either method alone.
- Fast, training-free, interpretable, and strong on exact and rare terms.
- Weak on vocabulary mismatch and paraphrased queries.
- Frequently paired with dense retrieval in hybrid search via rank fusion.
Key takeaways
- BM25 is a sparse keyword ranking function combining term frequency, inverse document frequency, and length normalization.
- Term frequency saturates, so repeated occurrences of a word give diminishing returns, controlled by k₁.
- The b parameter sets how strongly document length is normalized, typically 0.75.
- IDF gives rare, discriminating terms more weight than common ones.
- BM25 remains a strong baseline and serves as the lexical half of most hybrid search systems.
Frequently asked questions
Related terms
Related reading
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free