Question 1

Does perplexity predict whether users will like a chat model?

Accepted Answer

No. Perplexity measures how well the model's own distribution assigns probability to held-out text — a useful signal during pretraining, and nearly orthogonal to helpfulness after fine-tuning. A heavily-RLHF'd chat model can have *higher* perplexity on a news corpus than its base model because the RLHF shifted the distribution toward instruction-following. Use perplexity to catch gross training collapse; use task-level metrics (accuracy, pass@1, win-rate vs a reference, grounded-answer rate) as the real release gate.

Question 2

Why does BLEU score a synonym pair so low?

Accepted Answer

Because BLEU is literally counting matching n-grams between your output and the reference, with a brevity penalty. 'The feline rests upon the rug' shares almost no n-grams with 'The cat is on the mat' despite meaning the same thing. BLEU was designed for machine-translation evaluation against a small set of reference translations, where lexical overlap was a reasonable proxy; it fails on open-ended generation where valid outputs diverge lexically. ROUGE has the same limitation (it's n-gram recall instead of precision). The Step 2 demo makes this concrete.

Question 3

Is LLM-as-judge reliable enough to gate a release?

Accepted Answer

With caveats. It's the most useful proxy for human preference at scale, but you have to treat the score as a noisy signal: randomize A/B ordering to suppress position bias, control for length to suppress verbosity bias, use multiple judge models to suppress self-preference bias, and calibrate against a small human-rated set to verify the judge tracks humans on your domain. The Step 4 pattern — run every pair twice in swapped order and flag `position_bias_detected` — is the minimum production discipline. Single-run single-judge scores are a lie detector that sometimes lies.

Question 4

What do I actually do when position bias is detected?

Accepted Answer

Three options, increasing in rigor: randomize A/B order per query and aggregate across many comparisons (simplest); run every comparison in both orderings and only count decisive cases where both orderings agree (what the lab pattern builds toward); or use multiple judge models and majority-vote. The failure mode to avoid is a single judge + fixed ordering + a small eval set — that gives you a plausible-looking score that's largely noise.

Question 5

What's still missing even after all four lenses?

Accepted Answer

Safety and harm evaluation (adversarial prompts, jailbreak resistance, demographic bias probes), factual grounding against a reference corpus (hallucination rate), latency and cost at production load, regression tests on critical user prompts that must never change, and real user feedback signals. The four lenses in this lab are the *quality* gates; the full release gate stack also needs safety, infra, and user-signal layers. The reflection at the end asks exactly this question — what are you still not measuring.

Question 6

What does the grader enforce on each step?

Accepted Answer

Step 1 requires `ppl` to be a finite scalar in `5 < ppl < 1000` — GPT-2's sane English range. Step 2 requires `bleu_score` in [0,1] and `synonym_bleu < 0.3`, catching a learner who accidentally computed BLEU against the identical pair. Step 3 needs ≥2 prompts, each with ≥2 models producing non-empty text. Step 4 needs `judge_verdicts` covering ≥2 comparisons and `position_bias_detected` as a proper bool — meaning you actually ran both orderings rather than hard-coding the result.

Evaluation & Benchmarking LLMs

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll evaluate in this LLM benchmarking lab

Frequently asked questions