BM25 scores how well a document matches a query.

IDF (Inverse Document Frequency)

Measures how rare/important a word is.

Where:

  • = total documents in corpus
  • = documents containing word Example:
  • Word “the” appears in 1,000,000 out of 1,000,000 docs → IDF ≈ -7 (useless, penalized)
  • Word “transformer” appears in 50 out of 1,000,000 docs → IDF ≈ 9.9 (very useful, boosted)

Common words (the, and, is) get low or negative scores. Rare, meaningful words get high scores.

TF Component (Term Frequency with Saturation)

Counts how many times the word appears in the document, but with diminishing returns.

Where:

  • = count of word in document
  • = saturation parameter (default 1.2)

If a doc mentions “AI” 1 time vs 100 times, the 100-time doc isn’t 100x more relevant. Repetition has diminishing returns.

Length Normalization

Fairness adjustment so long documents don’t always win.

Where:

  • = normalization strength (default 0.75)
  • = length of document (word count)
  • = average document length in corpus

A long document naturally has more word matches. But a short article specifically about a topic is more relevant than a passing mention in a textbook.