Question 1

Why use CLS pooling with BGE instead of mean pooling?

Accepted Answer

BGE was trained with a dedicated CLS objective — the `[CLS]` token's final hidden state is what the contrastive loss shaped into a sentence embedding. Mean pooling over all token states is common for other encoders (sentence-BERT's bert-base-nli variants use it), but for BGE specifically it underperforms because those token states weren't optimized for pooling. Always check the model card: swapping pooling strategies between encoders is one of the silent ways to ship a subtly broken retriever.

Question 2

Why L2-normalize the embeddings before retrieval?

Accepted Answer

After normalization every vector has unit length, so `a @ b.T` produces values in `[-1, 1]` that equal `cos(a, b)` exactly. You get cosine similarity from a single matmul — no separate norm division, no numerical drift, and `torch.topk` sorts correctly on the raw scores. Skipping normalization doesn't break retrieval but forces you to divide by norms at query time and silently makes short passages look more similar than they should.

Question 3

How is this lab different from the Advanced RAG lab?

Accepted Answer

This one is the end-to-end baseline: dense retrieval, top-K, generation with and without context — the minimum viable pipeline, focused on correctness and on showing you that retrieval actually changes the answer. Advanced RAG adds BM25, Reciprocal Rank Fusion for hybrid search, and a cross-encoder reranker on top — the two-stage retrieval pattern used in production enterprise RAG. Do this one first; take that one when you want to see where the recall and precision gaps come from.

Question 4

Why does the lab not use a vector database like FAISS, Milvus, or pgvector?

Accepted Answer

Because the corpus fits in a single CUDA tensor, and at that scale a `@` matmul is faster and clearer than any ANN index. You want to understand that retrieval is ultimately a distance computation over a matrix — then the vector databases, graph indices, quantization tricks, and sharding schemes that ship at production scale are recognizable as engineering optimizations over this exact primitive, not mysterious black boxes.

Question 5

What happens if `rag_answer` equals `no_rag_answer`?

Accepted Answer

The grader fails Step 4 and tells you retrieval had no effect. The usual causes: the prompt template ignores the injected context (common with chat-formatted generators where the system message gets stripped), the top-K chunks didn't contain the answer so the model fell back to priors, or `max_new_tokens` is so low both answers are generic. The reflection step pushes you to separate these: chunker bug, embedder bug, retriever bug, or generator bug, each with a different fix.

Question 6

Can I swap BGE for a different embedding model?

Accepted Answer

Yes, and it's an instructive exercise. Try `intfloat/e5-small-v2` or `sentence-transformers/all-MiniLM-L6-v2` and watch whether retrieval quality changes on your specific corpus. You'll need to match the pooling strategy to the model (E5 and BGE use CLS; MiniLM uses mean) and re-check that output vectors are L2-normalized. The pipeline code otherwise doesn't change, which is the whole point of treating the embedder as a pluggable component.

Retrieval-Augmented Generation (RAG) Pipeline with Local Models

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this end-to-end RAG lab

Frequently asked questions

Why use CLS pooling with BGE instead of mean pooling?

Why L2-normalize the embeddings before retrieval?

How is this lab different from the Advanced RAG lab?

Why does the lab not use a vector database like FAISS, Milvus, or pgvector?

What happens if `rag_answer` equals `no_rag_answer`?

Can I swap BGE for a different embedding model?

Retrieval-Augmented Generation (RAG) Pipeline with Local Models

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this end-to-end RAG lab

Frequently asked questions

Why use CLS pooling with BGE instead of mean pooling?

Why L2-normalize the embeddings before retrieval?

How is this lab different from the Advanced RAG lab?

Why does the lab not use a vector database like FAISS, Milvus, or pgvector?

What happens if rag_answer equals no_rag_answer?

Can I swap BGE for a different embedding model?

What happens if `rag_answer` equals `no_rag_answer`?