Advanced RAG: Hybrid Search + Cross-Encoder Reranking
GPU sandbox · jupyter
Beta

Advanced RAG: Hybrid Search + Cross-Encoder Reranking

Build a production-shape retrieval stack — dense bi-encoder plus from-scratch BM25, fused with Reciprocal Rank Fusion, then re-ordered by a BAAI cross-encoder. The exact architecture behind modern enterprise RAG.

40 min·4 steps·2 domains·Advanced·ncp-genlnca-genl

What you'll learn

  1. 1
    Set up corpus + dense baseline, find a dense-failure case
  2. 2
    BM25 keyword retrieval from scratch
  3. 3
    Reciprocal Rank Fusion: combine dense + BM25
  4. 4
    Cross-encoder reranking

Prerequisites

  • Familiarity with transformer embedding models (BGE, sentence-transformers)
  • Completed a basic RAG pipeline (dense retrieval + LLM)
  • Comfortable with PyTorch tensors on CUDA

Exam domains covered

Retrieval-Augmented GenerationLLM Application Development

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

RAGBM25Reciprocal Rank FusionCross-EncoderRerankingBGEHybrid SearchRetrieval

What you'll build in this hybrid retrieval + reranking lab

Dense-only retrieval breaks on rare proper nouns, acronyms, and product codes — the exact queries that show up most in enterprise RAG. Hybrid search fused with cross-encoder reranking is the architecture every serious production system converges to, and in 40 minutes you'll build it end-to-end. You'll leave with a working two-stage retrieval stack (dense bi-encoder + BM25 fused via Reciprocal Rank Fusion, then reranked by a cross-encoder), a from-scratch BM25 implementation that makes the Robertson-Sparck-Jones IDF formula no longer feel like a black box, and a clear mental model of why bi-encoder vs cross-encoder is the trade-off that governs every modern RAG system. You'll also see the score gap between the correct doc and its runner-up widen after reranking — a visceral demo of what the cross-encoder is actually buying you over cheaper ranking methods.

The technical substance is the two asymmetries that force the two-stage shape. First asymmetry: bi-encoders like BAAI/bge-small-en-v1.5 encode query and passage independently, so passage vectors precompute offline and every query is one matmul — fast enough for a million-document corpus. Cross-encoders like BAAI/bge-reranker-base concatenate (query, passage) and run full attention across both, so every query token attends to every passage token — far more accurate, completely uncacheable, and linear in candidate count. You can't run a cross-encoder over a million docs; you can't trust a bi-encoder alone on rare-term queries. Second asymmetry: dense cosine scores live in [-1, 1] while BM25 scales with IDF and document length into the 20s. RRF sidesteps this with 1 / (k + rank), k=60 — rank-based fusion that's calibration-free, monotonic, and the default in modern retrieval papers because any normalization you pick (min-max, z-score) is a brittle hyperparameter that drifts with the corpus. You'll also see the practical tuning lever: top-5 reranking gives visible precision gains at essentially zero latency cost, top-100 is where you start hurting the end-user SLA.

You should already be comfortable with transformer embeddings (BGE, sentence-transformers), have built at least one basic RAG pipeline (the rag-pipeline lab is the warm-up if not), and know your way around PyTorch CUDA tensors. The sandbox is a real NVIDIA GPU pod we provision per session with BGE, the BAAI reranker checkpoint, and dependencies preinstalled. Checks run strict against real retrieval outcomes — L2-normalized embeddings paired with a failure-case dict, BM25 actually retrieving the rare-keyword doc the dense encoder missed, RRF producing the correct top-1, and the reranker preserving the correct doc at top-1 with a widened score gap over the runner-up.

Frequently asked questions

Why build hybrid + reranking when a strong dense encoder like BGE gets most queries right?

Because 'most' isn't good enough for enterprise RAG. BGE-small is excellent on semantic paraphrase but still fragile on rare proper nouns, acronyms, product codes, and exact-match keyword queries — the exact patterns BM25 was designed for. Hybrid via RRF costs you one extra sparse retriever (milliseconds, no GPU) and strictly dominates either component alone. Reranking on top adds joint query-passage attention to the top-K and tightens precision where it matters most — the ranked list you actually pass to the LLM.

Why Reciprocal Rank Fusion instead of score averaging or min-max normalization?

Because dense cosine scores and BM25 scores live in incompatible distributions — cosine is bounded in roughly [-1, 1] while BM25 scales with IDF and document length and can easily exceed 20. Any normalization you pick (min-max, z-score, linear scaling) is a brittle hyperparameter that drifts as the corpus changes. RRF throws scores away entirely and fuses ranks with 1 / (k + rank), k=60. It is calibration-free, monotonic, and what every modern retrieval paper defaults to.

Why rerank only the top-5 and not the top-100?

Cost. A cross-encoder must run the full transformer over every (query, passage) pair with no caching — latency scales linearly with the candidate count. Top-5 gives you visible precision gains at essentially zero extra wall-clock budget; top-100 is where you start hurting the end-user. In production you tune K to your SLA: 10-50 is typical, 100+ only when recall matters more than latency and you can batch aggressively on the GPU.

Does BM25 really need to be written from scratch — isn't there a library?

There are several (rank_bm25, pyserini, Lucene bindings) and you absolutely use them in production. The lab implements it by hand because the formula — IDF weighting, term-frequency saturation, length normalization — is the entire reason BM25 beats naive keyword matching, and you should be able to read it off a page. Once you've written the 30-line version, swapping in a fast library is a trivial optimization, not a conceptual step.

What if my dense encoder already gets the rare-keyword query right?

The check script handles that case: BGE-small is strong enough that on a tiny corpus it sometimes succeeds on exactly the query you designed to break it. The pipeline still demonstrates the full hybrid + rerank path either way. At real corpus scale (hundreds of thousands of passages) and with naturally adversarial queries, dense-only regresses frequently enough that hybrid remains the production default — this lab gives you the instrumentation to see it.

How is the cross-encoder different from just a bigger bi-encoder?

Architecture, not size. A bi-encoder encodes query and passage independently and compares fixed vectors — attention never flows between them. A cross-encoder concatenates query and passage into a single sequence, so every query token attends to every passage token through all layers. That joint attention captures reasoning about negation, quantifiers, and co-reference that no amount of embedding-dimensionality can express. Cost: you cannot precompute passage vectors, so you only run it on a short candidate list.