Retrieval-Augmented Generation (RAG) Pipeline with Local Models
Build an end-to-end RAG pipeline on a single GPU: BGE embeddings, L2-normalized vector retrieval by dot product, and a local generator that answers with and without retrieved context so you can see exactly what retrieval changes.
What you'll learn
- 1Build a corpus + load the embedding model
- 2Chunk the documents, embed the chunks
- 3Retrieval: top-K nearest chunks
- 4Generate with and without retrieval context
Prerequisites
- Python + PyTorch basics
- Familiarity with transformer embeddings and tokenization
- Basic linear algebra (dot product, cosine similarity)
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this end-to-end RAG lab
RAG is the production must-know for anyone adding LLMs to a real product — retrieval is the mechanism that lets a frozen model talk about your data, this week's data, or data too large to fit in any context window. In 45 minutes you'll stand up a complete Retrieval-Augmented Generation pipeline on a single GPU with no vector database and no orchestration framework, just the primitives underneath every production stack: a BGE bi-encoder, an L2-normalized passage index that lives as one CUDA tensor, dot-product top-K retrieval via a single matmul, and a local causal LM that answers the same question with and without retrieved context so you can see exactly what retrieval changes. You'll walk away with a clear mental model of why dot product equals cosine similarity on unit-norm vectors, why every FAISS/Milvus/pgvector index is an engineering optimization over this exact primitive, and a diagnostic framework for the four places RAG pipelines break (chunker, embedder, retriever, generator).
The technical substance is where each primitive comes from and why. BGE was trained with a dedicated CLS objective, so the [CLS] token's final hidden state is the sentence embedding — mean pooling silently underperforms even though it's the default for sentence-BERT style encoders. L2-normalization isn't decoration: it makes a @ b.T produce values in [-1, 1] that equal cos(a, b) exactly, which means torch.topk sorts correctly on raw scores without a separate norm division. Step 4's grader requires rag_answer != no_rag_answer so you can't get credit for retrieval that didn't actually condition generation — and when they do differ, the reflection asks you to diagnose whether retrieval changed the answer by grounding it or just by perturbing it, which is the real question. You'll also see the operational points engineers learn the hard way: high cosine similarity can still surface irrelevant passages when query and corpus are paraphrased away from each other, a bad retriever gives the model confident-looking wrong context to hallucinate from, and 'RAG fixes hallucinations' is only true when recall@k and faithfulness are measured separately.
Prerequisites are Python plus PyTorch basics, familiarity with transformer tokenization, and enough linear algebra to know what a dot product is. The sandbox is a real NVIDIA GPU pod we provision per session with BGE, the generator model, tokenizers, and CUDA preinstalled. Checks are strict about correctness — L2-normalized embeddings (unit norm within 1e-3), passage vectors on GPU with dimensions matching the chunk count, top-K ordering descending with the top-1 for a Llama-3-tokens query actually retrieving a relevant chunk, and the final answer grader failing if rag_answer == no_rag_answer. Once you have this baseline working, the Advanced RAG lab adds BM25, Reciprocal Rank Fusion for hybrid search, and cross-encoder reranking on top.
Frequently asked questions
Why use CLS pooling with BGE instead of mean pooling?
[CLS] token's final hidden state is what the contrastive loss shaped into a sentence embedding. Mean pooling over all token states is common for other encoders (sentence-BERT's bert-base-nli variants use it), but for BGE specifically it underperforms because those token states weren't optimized for pooling. Always check the model card: swapping pooling strategies between encoders is one of the silent ways to ship a subtly broken retriever.Why L2-normalize the embeddings before retrieval?
a @ b.T produces values in [-1, 1] that equal cos(a, b) exactly. You get cosine similarity from a single matmul — no separate norm division, no numerical drift, and torch.topk sorts correctly on the raw scores. Skipping normalization doesn't break retrieval but forces you to divide by norms at query time and silently makes short passages look more similar than they should.How is this lab different from the Advanced RAG lab?
Why does the lab not use a vector database like FAISS, Milvus, or pgvector?
@ matmul is faster and clearer than any ANN index. You want to understand that retrieval is ultimately a distance computation over a matrix — then the vector databases, graph indices, quantization tricks, and sharding schemes that ship at production scale are recognizable as engineering optimizations over this exact primitive, not mysterious black boxes.What happens if rag_answer equals no_rag_answer?
rag_answer equals no_rag_answer?max_new_tokens is so low both answers are generic. The reflection step pushes you to separate these: chunker bug, embedder bug, retriever bug, or generator bug, each with a different fix.Can I swap BGE for a different embedding model?
intfloat/e5-small-v2 or sentence-transformers/all-MiniLM-L6-v2 and watch whether retrieval quality changes on your specific corpus. You'll need to match the pooling strategy to the model (E5 and BGE use CLS; MiniLM uses mean) and re-check that output vectors are L2-normalized. The pipeline code otherwise doesn't change, which is the whole point of treating the embedder as a pluggable component.