Retrieval-Augmented Generation (RAG) Pipeline with Local Models
GPU sandbox · jupyter
Beta

Retrieval-Augmented Generation (RAG) Pipeline with Local Models

Build an end-to-end RAG pipeline on a single GPU: BGE embeddings, L2-normalized vector retrieval by dot product, and a local generator that answers with and without retrieved context so you can see exactly what retrieval changes.

45 min·4 steps·2 domains·Intermediate·ncp-genlnca-genl

What you'll learn

  1. 1
    Build a corpus + load the embedding model
  2. 2
    Chunk the documents, embed the chunks
  3. 3
    Retrieval: top-K nearest chunks
  4. 4
    Generate with and without retrieval context

Prerequisites

  • Python + PyTorch basics
  • Familiarity with transformer embeddings and tokenization
  • Basic linear algebra (dot product, cosine similarity)

Exam domains covered

Retrieval-Augmented Generation & Vector SearchLLM Application Development

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

RAGEmbeddingsBGEVector SearchCosine SimilarityRetrievalLLMGrounding

What you'll build in this end-to-end RAG lab

RAG is the production must-know for anyone adding LLMs to a real product — retrieval is the mechanism that lets a frozen model talk about your data, this week's data, or data too large to fit in any context window. In 45 minutes you'll stand up a complete Retrieval-Augmented Generation pipeline on a single GPU with no vector database and no orchestration framework, just the primitives underneath every production stack: a BGE bi-encoder, an L2-normalized passage index that lives as one CUDA tensor, dot-product top-K retrieval via a single matmul, and a local causal LM that answers the same question with and without retrieved context so you can see exactly what retrieval changes. You'll walk away with a clear mental model of why dot product equals cosine similarity on unit-norm vectors, why every FAISS/Milvus/pgvector index is an engineering optimization over this exact primitive, and a diagnostic framework for the four places RAG pipelines break (chunker, embedder, retriever, generator).

The technical substance is where each primitive comes from and why. BGE was trained with a dedicated CLS objective, so the [CLS] token's final hidden state is the sentence embedding — mean pooling silently underperforms even though it's the default for sentence-BERT style encoders. L2-normalization isn't decoration: it makes a @ b.T produce values in [-1, 1] that equal cos(a, b) exactly, which means torch.topk sorts correctly on raw scores without a separate norm division. Step 4's grader requires rag_answer != no_rag_answer so you can't get credit for retrieval that didn't actually condition generation — and when they do differ, the reflection asks you to diagnose whether retrieval changed the answer by grounding it or just by perturbing it, which is the real question. You'll also see the operational points engineers learn the hard way: high cosine similarity can still surface irrelevant passages when query and corpus are paraphrased away from each other, a bad retriever gives the model confident-looking wrong context to hallucinate from, and 'RAG fixes hallucinations' is only true when recall@k and faithfulness are measured separately.

Prerequisites are Python plus PyTorch basics, familiarity with transformer tokenization, and enough linear algebra to know what a dot product is. The sandbox is a real NVIDIA GPU pod we provision per session with BGE, the generator model, tokenizers, and CUDA preinstalled. Checks are strict about correctness — L2-normalized embeddings (unit norm within 1e-3), passage vectors on GPU with dimensions matching the chunk count, top-K ordering descending with the top-1 for a Llama-3-tokens query actually retrieving a relevant chunk, and the final answer grader failing if rag_answer == no_rag_answer. Once you have this baseline working, the Advanced RAG lab adds BM25, Reciprocal Rank Fusion for hybrid search, and cross-encoder reranking on top.

Frequently asked questions

Why use CLS pooling with BGE instead of mean pooling?

BGE was trained with a dedicated CLS objective — the [CLS] token's final hidden state is what the contrastive loss shaped into a sentence embedding. Mean pooling over all token states is common for other encoders (sentence-BERT's bert-base-nli variants use it), but for BGE specifically it underperforms because those token states weren't optimized for pooling. Always check the model card: swapping pooling strategies between encoders is one of the silent ways to ship a subtly broken retriever.

Why L2-normalize the embeddings before retrieval?

After normalization every vector has unit length, so a @ b.T produces values in [-1, 1] that equal cos(a, b) exactly. You get cosine similarity from a single matmul — no separate norm division, no numerical drift, and torch.topk sorts correctly on the raw scores. Skipping normalization doesn't break retrieval but forces you to divide by norms at query time and silently makes short passages look more similar than they should.

How is this lab different from the Advanced RAG lab?

This one is the end-to-end baseline: dense retrieval, top-K, generation with and without context — the minimum viable pipeline, focused on correctness and on showing you that retrieval actually changes the answer. Advanced RAG adds BM25, Reciprocal Rank Fusion for hybrid search, and a cross-encoder reranker on top — the two-stage retrieval pattern used in production enterprise RAG. Do this one first; take that one when you want to see where the recall and precision gaps come from.

Why does the lab not use a vector database like FAISS, Milvus, or pgvector?

Because the corpus fits in a single CUDA tensor, and at that scale a @ matmul is faster and clearer than any ANN index. You want to understand that retrieval is ultimately a distance computation over a matrix — then the vector databases, graph indices, quantization tricks, and sharding schemes that ship at production scale are recognizable as engineering optimizations over this exact primitive, not mysterious black boxes.

What happens if rag_answer equals no_rag_answer?

The grader fails Step 4 and tells you retrieval had no effect. The usual causes: the prompt template ignores the injected context (common with chat-formatted generators where the system message gets stripped), the top-K chunks didn't contain the answer so the model fell back to priors, or max_new_tokens is so low both answers are generic. The reflection step pushes you to separate these: chunker bug, embedder bug, retriever bug, or generator bug, each with a different fix.

Can I swap BGE for a different embedding model?

Yes, and it's an instructive exercise. Try intfloat/e5-small-v2 or sentence-transformers/all-MiniLM-L6-v2 and watch whether retrieval quality changes on your specific corpus. You'll need to match the pooling strategy to the model (E5 and BGE use CLS; MiniLM uses mean) and re-check that output vectors are L2-normalized. The pipeline code otherwise doesn't change, which is the whole point of treating the embedder as a pluggable component.