BM25

BM25 scores how well a document matches a query.

$BM 25 (D, Q) = \sum_{q \in Q} I D F (q) \cdot \frac{TF ( q , D ) \cdot ( k _{1} + 1 )}{TF ( q , D ) + k _{1} \cdot ( 1 - b + b \cdot \frac{∣ D ∣}{a vg d l} )}$

IDF (Inverse Document Frequency)

Measures how rare/important a word is.

$I D F (q) = lo g (\frac{N - n _{q} + 0.5}{n _{q} + 0.5})$ Where:

$N$ = total documents in corpus
$n_{q}$ = documents containing word $q$ Example:
Word “the” appears in 1,000,000 out of 1,000,000 docs → IDF ≈ -7 (useless, penalized)
Word “transformer” appears in 50 out of 1,000,000 docs → IDF ≈ 9.9 (very useful, boosted)

Common words (the, and, is) get low or negative scores. Rare, meaningful words get high scores.

TF Component (Term Frequency with Saturation)

Counts how many times the word appears in the document, but with diminishing returns.

$\frac{TF ( q , D ) \cdot ( k _{1} + 1 )}{TF ( q , D ) + k _{1}}$ Where:

$TF (q, D)$ = count of word $q$ in document $D$
$k_{1}$ = saturation parameter (default 1.2)

If a doc mentions “AI” 1 time vs 100 times, the 100-time doc isn’t 100x more relevant. Repetition has diminishing returns.

Length Normalization

Fairness adjustment so long documents don’t always win.

$1 - b + b \cdot \frac{∣ D ∣}{a vg d l}$ Where:

$b$ = normalization strength (default 0.75)
$∣ D ∣$ = length of document (word count)
$a vg d l$ = average document length in corpus

A long document naturally has more word matches. But a short article specifically about a topic is more relevant than a passing mention in a textbook.

Aayush's ML & AI Notes

Explorer

BM25

IDF (Inverse Document Frequency)

TF Component (Term Frequency with Saturation)

Length Normalization

Graph View

Table of Contents

Backlinks