LLMs as Rerankers: A Case Study on Hybrid Email Search
Session Abstract
Purpose-built rerankers are faster and cheaper, but are they better? We argue LLM rerankers win on what matters most in production: instruction-following and iteration speed, with more-than-acceptable tradeoffs on cost and latency. Our discussion is backed by a case study from Superhuman’s production hybrid email search system.
Session Description
The conventional wisdom, backed by vendor benchmarks, is that purpose-built rerankers are more accurate than LLMs at ranking. We challenge this. In our experience building and maintaining production search systems, LLM rerankers deliver better search results and faster improvements to user experience, primarily because they are flexible tools that excel at following complex instructions.
This talk makes the case for LLMs as rerankers through three lenses: iterability, capability, and cost. Each lens is supported by results from the Superhuman case study.
Iterability
Day-to-day work on a production search system means triaging a list of failure cases. With a traditional reranker, fixing these means preparing data, fine-tuning, and deploying a custom model. With an LLM reranker, you edit a prompt.
This difference sounds incremental but compounds quickly. When improving results is as easy as refining an instruction, teams naturally spend more time examining their data and shipping fixes. Results improve week over week instead of quarter over quarter. In a landscape where user expectations of AI products shift constantly, this iteration speed is a decisive advantage that reranking benchmarks do not capture.
Superhuman improved search results by running fast ablation cycles. The hypothesis → config change → rerun → measure loop is much more practical when relevance logic lives in prompts rather than a model training pipeline.
Capability
A traditional reranker takes a query and a document and returns a scalar score. An LLM reranker can do that and much more. The same model pass that ranks your documents can also consolidate facts across them, flag contradictions, discard distractor segments, or annotate specific passages. Your “reranker” becomes a reasoning layer, not just a sorting function.
This flexibility extends to instruction-following. Negation instructions are a useful illustration: “ignore documents where the only relevant segment is a table of contents” is straightforward to express in a prompt but notoriously difficult for smaller instruction-following rerankers to handle reliably. The gap between LLMs and specialized rerankers on complex, nuanced instructions reflects fundamental differences in model scale, training data breadth, and the ability to leverage test-time compute.
Superhuman’s tests reveal that LLM rerankers enable safe over-retrieval with instruction-aware filtering downstream.
Cost
Given the scale difference – 100B+ parameter LLMs vs. sub-4B parameter rerankers — you might expect dramatically higher costs. In practice, batch inference, sparse mixture-of-experts architectures, prompt caching, and competitive pricing dynamics have narrowed the gap considerably. The primary remaining tradeoff is latency; we’ll discuss when that tradeoff matters and when it doesn’t.
Superhuman’s results make the economics feel less abstract: the biggest quality gains came from increasing retrieval depth and rebalancing hybrid weighting. The “expensive” part was simply letting the system consider more candidates and then using an LLM to make the final call. This is often a good trade in production because compute spent on reranking scales with retrieval depth, and you can tune that knob directly based on latency budgets and observed recall/precision needs.
Case study: Improving recall in hybrid email search
We will share findings from Superhuman’s email search system, where systematic ablation experiments across retrieval depth, vector-keyword weighting, recency bias, and filtering strategies revealed that the largest recall gains came from loosening upstream retrieval constraints and trusting the LLM reranker to handle relevance downstream. We’ll walk through the experimental setup, the failure modes uncovered, and how the results informed changes to their production pipeline.
What attendees will take away
This talk is structured to leave the audience with three concrete things:
1. A decision framework for when LLM rerankers are the right choice over dedicated ones, centered on how ambiguous and fast-evolving your relevance criteria are.
2. Engineering patterns for making LLM reranking production-viable, including prompt design, latency management, and output structuring.
3. Experimental evidence from a real production system that made the switch, including the methodology for running your own comparison.
