Evaluation & Benchmarking LLMs
Four evaluation lenses in one lab: compute real perplexity, expose BLEU's blindness to paraphrase, run side-by-side model comparisons, and build an LLM-as-judge harness with position-bias detection.
What you'll learn
- 1Perplexity: the branching-factor intuition
- 2BLEU & ROUGE: what they measure and what they miss
- 3Side-by-side model comparison (GPT-2 vs GPT-2 Medium)
- 4LLM-as-a-Judge: the modern pattern (and its pitfalls)
Prerequisites
- Can load a Hugging Face causal LM and run `.generate()`
- Understanding of cross-entropy loss and log-likelihood
- Comfortable with basic Python data structures and strings
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll evaluate in this LLM benchmarking lab
Shipping an LLM feature without a real evaluation stack is how teams end up debugging hallucinations in production. In 45 minutes you'll touch the four evaluation lenses that matter — perplexity, BLEU/ROUGE, side-by-side generations, and LLM-as-judge — and see concretely where each one fails, so you can triangulate across them instead of trusting any single score. You'll walk away with a working LLM-as-judge harness that runs every comparison twice in swapped order to flag position bias, a concrete demo of why BLEU scores paraphrase as 'different' even when the meaning is identical, the branching-factor intuition for perplexity (exp(loss) ≈ 'the model is as uncertain as if choosing from ~N equally-likely next tokens'), and a defensible opinion about which subset of metrics you'd actually gate a release on — plus the dimensions you're still missing (safety, grounding, latency under load, regression tests on critical prompts).
The technical substance is the specific failure modes each metric hides. Perplexity: compute it the right way with model.eval(), labels=input_ids so HuggingFace shifts them internally, and ppl = math.exp(mean_cross_entropy) — but recognize it tracks likelihood under the model's own distribution, which is nearly orthogonal to helpfulness in chat (a heavily RLHF'd model can have higher perplexity on a news corpus than its base). BLEU/ROUGE: n-gram overlap metrics designed for machine-translation evaluation against reference translations — they score 'The feline rests upon the rug' against 'The cat is on the mat' as almost unrelated. The grader enforces synonym_bleu < 0.3 so you actually computed BLEU across the synonym pair, not the exact-match one. LLM-as-judge: the most useful proxy for human preference at scale, but it carries position bias (judges favor whichever response appeared first or second), verbosity bias (longer answers win), and self-preference bias (a model rates its own style as better). The judge-swap pattern — run every pair twice, once as (A, B), once as (B, A), flag position_bias_detected = True when any verdict flips on swap — is the minimum instrumentation that keeps a single-judge score from being pure noise. The production escalation path from there is randomized ordering + multiple judges + calibration against a small human-rated set on your domain.
You need to be able to load a HuggingFace causal LM, run .generate(), and understand cross-entropy loss at an intuitive level. The sandbox is a real NVIDIA GPU pod we provision per session with transformers, GPT-2 + GPT-2 Medium, BLEU/ROUGE implementations, and a judge-model configuration all preinstalled. Checks hold measurements to sensible ranges rather than magic numbers: ppl finite and in 5 < ppl < 1000 for GPT-2 on English, bleu_score in [0,1], synonym_bleu < 0.3 (catches a learner who accidentally BLEU'd the identical pair), generations spans ≥2 prompts across ≥2 models with non-empty text, judge_verdicts has ≥2 comparisons, and position_bias_detected is a proper bool derived from swapped-order runs.
Frequently asked questions
Does perplexity predict whether users will like a chat model?
Why does BLEU score a synonym pair so low?
Is LLM-as-judge reliable enough to gate a release?
position_bias_detected — is the minimum production discipline. Single-run single-judge scores are a lie detector that sometimes lies.What do I actually do when position bias is detected?
What's still missing even after all four lenses?
What does the grader enforce on each step?
ppl to be a finite scalar in 5 < ppl < 1000 — GPT-2's sane English range. Step 2 requires bleu_score in [0,1] and synonym_bleu < 0.3, catching a learner who accidentally computed BLEU against the identical pair. Step 3 needs ≥2 prompts, each with ≥2 models producing non-empty text. Step 4 needs judge_verdicts covering ≥2 comparisons and position_bias_detected as a proper bool — meaning you actually ran both orderings rather than hard-coding the result.