Advanced RAG: Hybrid Search + Cross-Encoder Reranking
Build a production-shape retrieval stack — dense bi-encoder plus from-scratch BM25, fused with Reciprocal Rank Fusion, then re-ordered by a BAAI cross-encoder. The exact architecture behind modern enterprise RAG.
What you'll learn
- 1Set up corpus + dense baseline, find a dense-failure case
- 2BM25 keyword retrieval from scratch
- 3Reciprocal Rank Fusion: combine dense + BM25
- 4Cross-encoder reranking
Prerequisites
- Familiarity with transformer embedding models (BGE, sentence-transformers)
- Completed a basic RAG pipeline (dense retrieval + LLM)
- Comfortable with PyTorch tensors on CUDA
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this hybrid retrieval + reranking lab
Dense-only retrieval breaks on rare proper nouns, acronyms, and product codes — the exact queries that show up most in enterprise RAG. Hybrid search fused with cross-encoder reranking is the architecture every serious production system converges to, and in 40 minutes you'll build it end-to-end. You'll leave with a working two-stage retrieval stack (dense bi-encoder + BM25 fused via Reciprocal Rank Fusion, then reranked by a cross-encoder), a from-scratch BM25 implementation that makes the Robertson-Sparck-Jones IDF formula no longer feel like a black box, and a clear mental model of why bi-encoder vs cross-encoder is the trade-off that governs every modern RAG system. You'll also see the score gap between the correct doc and its runner-up widen after reranking — a visceral demo of what the cross-encoder is actually buying you over cheaper ranking methods.
The technical substance is the two asymmetries that force the two-stage shape. First asymmetry: bi-encoders like BAAI/bge-small-en-v1.5 encode query and passage independently, so passage vectors precompute offline and every query is one matmul — fast enough for a million-document corpus. Cross-encoders like BAAI/bge-reranker-base concatenate (query, passage) and run full attention across both, so every query token attends to every passage token — far more accurate, completely uncacheable, and linear in candidate count. You can't run a cross-encoder over a million docs; you can't trust a bi-encoder alone on rare-term queries. Second asymmetry: dense cosine scores live in [-1, 1] while BM25 scales with IDF and document length into the 20s. RRF sidesteps this with 1 / (k + rank), k=60 — rank-based fusion that's calibration-free, monotonic, and the default in modern retrieval papers because any normalization you pick (min-max, z-score) is a brittle hyperparameter that drifts with the corpus. You'll also see the practical tuning lever: top-5 reranking gives visible precision gains at essentially zero latency cost, top-100 is where you start hurting the end-user SLA.
You should already be comfortable with transformer embeddings (BGE, sentence-transformers), have built at least one basic RAG pipeline (the rag-pipeline lab is the warm-up if not), and know your way around PyTorch CUDA tensors. The sandbox is a real NVIDIA GPU pod we provision per session with BGE, the BAAI reranker checkpoint, and dependencies preinstalled. Checks run strict against real retrieval outcomes — L2-normalized embeddings paired with a failure-case dict, BM25 actually retrieving the rare-keyword doc the dense encoder missed, RRF producing the correct top-1, and the reranker preserving the correct doc at top-1 with a widened score gap over the runner-up.
Frequently asked questions
Why build hybrid + reranking when a strong dense encoder like BGE gets most queries right?
Why Reciprocal Rank Fusion instead of score averaging or min-max normalization?
1 / (k + rank), k=60. It is calibration-free, monotonic, and what every modern retrieval paper defaults to.Why rerank only the top-5 and not the top-100?
(query, passage) pair with no caching — latency scales linearly with the candidate count. Top-5 gives you visible precision gains at essentially zero extra wall-clock budget; top-100 is where you start hurting the end-user. In production you tune K to your SLA: 10-50 is typical, 100+ only when recall matters more than latency and you can batch aggressively on the GPU.