When BM25 Scores Disagree: A Corpus-Independent Alternative

Session Abstract

In distributed search, BM25 returns different results across nodes because IDF and average document length vary with each node’s corpus state. StableTfl replaces these with a term-length rarity heuristic, eliminating all corpus dependency. On 22 BEIR datasets, it retains ~90% of BM25’s NDCG@10 while guaranteeing identical rankings across nodes.

Session Description

BM25 relies on two corpus-level statistics: inverse document frequency and average document length. In a distributed search system where nodes index independently and converge only eventually, these statistics differ across nodes — and the same query produces different rankings depending on which node serves it. For retrieval pipelines that expect deterministic results, particularly RAG systems and hybrid search architectures that fuse lexical and vector scores, this inconsistency is a production-grade problem.

StableTfl is a drop-in BM25 replacement built as a Lucene Similarity that eliminates all corpus-level dependencies. It replaces IDF with a synthetic term-rarity function based on term character length — longer terms are rarer in natural language, so character count serves as a proxy for inverse document frequency. Document length normalization is folded into the same function rather than relying on the corpus average. The result: scoring depends only on the query term, its frequency in the document, and the document’s length. Two nodes with completely different corpora will always produce identical rankings.

Benchmarked against BM25Okapi on 22 BEIR datasets with identical tokenization, StableTfl retains roughly 90% of BM25’s average NDCG@10 (0.299 vs 0.331). BM25 wins on 19 datasets, but StableTfl matches or beats BM25 on argument retrieval, COVID-19 literature search, and open-domain QA — domains where term-level rarity appears to matter more than collection-specific frequency patterns. There is no additional runtime overhead compared to BM25, since term rarity values can be precomputed into a 256-entry lookup table.

The talk will cover: (1) why corpus statistics break consistency in distributed search, (2) how StableTfl works as a Lucene Similarity, (3) where the quality trade-off hurts most and where it’s minimal, and (4) how to evaluate whether corpus-independent scoring fits your retrieval stack — especially if you’re building hybrid or RAG pipelines where result consistency is as important as raw relevance.

Main Stage
07.May 2026
16:00pm - 16:45pm
Talk