Evaluation & Benchmarking LLMs
GPU sandbox · jupyter
Beta

Evaluation & Benchmarking LLMs

Four evaluation lenses in one lab: compute real perplexity, expose BLEU's blindness to paraphrase, run side-by-side model comparisons, and build an LLM-as-judge harness with position-bias detection.

45 min·4 steps·3 domains·Intermediate·ncp-genlnca-genlncp-ads

What you'll learn

  1. 1
    Perplexity: the branching-factor intuition
  2. 2
    BLEU & ROUGE: what they measure and what they miss
  3. 3
    Side-by-side model comparison (GPT-2 vs GPT-2 Medium)
  4. 4
    LLM-as-a-Judge: the modern pattern (and its pitfalls)

Prerequisites

  • Can load a Hugging Face causal LM and run `.generate()`
  • Understanding of cross-entropy loss and log-likelihood
  • Comfortable with basic Python data structures and strings

Exam domains covered

ExperimentationLLM Fundamentals & ArchitecturePrompt Engineering

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

PerplexityBLEUROUGELLM-as-JudgeEvaluationBenchmarkingPosition Bias

What you'll evaluate in this LLM benchmarking lab

Shipping an LLM feature without a real evaluation stack is how teams end up debugging hallucinations in production. In 45 minutes you'll touch the four evaluation lenses that matter — perplexity, BLEU/ROUGE, side-by-side generations, and LLM-as-judge — and see concretely where each one fails, so you can triangulate across them instead of trusting any single score. You'll walk away with a working LLM-as-judge harness that runs every comparison twice in swapped order to flag position bias, a concrete demo of why BLEU scores paraphrase as 'different' even when the meaning is identical, the branching-factor intuition for perplexity (exp(loss) ≈ 'the model is as uncertain as if choosing from ~N equally-likely next tokens'), and a defensible opinion about which subset of metrics you'd actually gate a release on — plus the dimensions you're still missing (safety, grounding, latency under load, regression tests on critical prompts).

The technical substance is the specific failure modes each metric hides. Perplexity: compute it the right way with model.eval(), labels=input_ids so HuggingFace shifts them internally, and ppl = math.exp(mean_cross_entropy) — but recognize it tracks likelihood under the model's own distribution, which is nearly orthogonal to helpfulness in chat (a heavily RLHF'd model can have higher perplexity on a news corpus than its base). BLEU/ROUGE: n-gram overlap metrics designed for machine-translation evaluation against reference translations — they score 'The feline rests upon the rug' against 'The cat is on the mat' as almost unrelated. The grader enforces synonym_bleu < 0.3 so you actually computed BLEU across the synonym pair, not the exact-match one. LLM-as-judge: the most useful proxy for human preference at scale, but it carries position bias (judges favor whichever response appeared first or second), verbosity bias (longer answers win), and self-preference bias (a model rates its own style as better). The judge-swap pattern — run every pair twice, once as (A, B), once as (B, A), flag position_bias_detected = True when any verdict flips on swap — is the minimum instrumentation that keeps a single-judge score from being pure noise. The production escalation path from there is randomized ordering + multiple judges + calibration against a small human-rated set on your domain.

You need to be able to load a HuggingFace causal LM, run .generate(), and understand cross-entropy loss at an intuitive level. The sandbox is a real NVIDIA GPU pod we provision per session with transformers, GPT-2 + GPT-2 Medium, BLEU/ROUGE implementations, and a judge-model configuration all preinstalled. Checks hold measurements to sensible ranges rather than magic numbers: ppl finite and in 5 < ppl < 1000 for GPT-2 on English, bleu_score in [0,1], synonym_bleu < 0.3 (catches a learner who accidentally BLEU'd the identical pair), generations spans ≥2 prompts across ≥2 models with non-empty text, judge_verdicts has ≥2 comparisons, and position_bias_detected is a proper bool derived from swapped-order runs.

Frequently asked questions

Does perplexity predict whether users will like a chat model?

No. Perplexity measures how well the model's own distribution assigns probability to held-out text — a useful signal during pretraining, and nearly orthogonal to helpfulness after fine-tuning. A heavily-RLHF'd chat model can have higher perplexity on a news corpus than its base model because the RLHF shifted the distribution toward instruction-following. Use perplexity to catch gross training collapse; use task-level metrics (accuracy, pass@1, win-rate vs a reference, grounded-answer rate) as the real release gate.

Why does BLEU score a synonym pair so low?

Because BLEU is literally counting matching n-grams between your output and the reference, with a brevity penalty. 'The feline rests upon the rug' shares almost no n-grams with 'The cat is on the mat' despite meaning the same thing. BLEU was designed for machine-translation evaluation against a small set of reference translations, where lexical overlap was a reasonable proxy; it fails on open-ended generation where valid outputs diverge lexically. ROUGE has the same limitation (it's n-gram recall instead of precision). The Step 2 demo makes this concrete.

Is LLM-as-judge reliable enough to gate a release?

With caveats. It's the most useful proxy for human preference at scale, but you have to treat the score as a noisy signal: randomize A/B ordering to suppress position bias, control for length to suppress verbosity bias, use multiple judge models to suppress self-preference bias, and calibrate against a small human-rated set to verify the judge tracks humans on your domain. The Step 4 pattern — run every pair twice in swapped order and flag position_bias_detected — is the minimum production discipline. Single-run single-judge scores are a lie detector that sometimes lies.

What do I actually do when position bias is detected?

Three options, increasing in rigor: randomize A/B order per query and aggregate across many comparisons (simplest); run every comparison in both orderings and only count decisive cases where both orderings agree (what the lab pattern builds toward); or use multiple judge models and majority-vote. The failure mode to avoid is a single judge + fixed ordering + a small eval set — that gives you a plausible-looking score that's largely noise.

What's still missing even after all four lenses?

Safety and harm evaluation (adversarial prompts, jailbreak resistance, demographic bias probes), factual grounding against a reference corpus (hallucination rate), latency and cost at production load, regression tests on critical user prompts that must never change, and real user feedback signals. The four lenses in this lab are the quality gates; the full release gate stack also needs safety, infra, and user-signal layers. The reflection at the end asks exactly this question — what are you still not measuring.

What does the grader enforce on each step?

Step 1 requires ppl to be a finite scalar in 5 < ppl < 1000 — GPT-2's sane English range. Step 2 requires bleu_score in [0,1] and synonym_bleu < 0.3, catching a learner who accidentally computed BLEU against the identical pair. Step 3 needs ≥2 prompts, each with ≥2 models producing non-empty text. Step 4 needs judge_verdicts covering ≥2 comparisons and position_bias_detected as a proper bool — meaning you actually ran both orderings rather than hard-coding the result.