Build & submit taskBetaintermediate

Build an LLM Evaluation Harness

An LLM you can't measure is one you can't trust. Build an evaluation harness that scores model outputs with a deterministic metric and an LLM-as-judge that mitigates position bias, then prints a report with per-example and aggregate scores. Submit a single script or notebook for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Deterministic metrics

Exact-match and keyword scoring give cheap, reproducible signal.

LLM-as-judge

Use a model to grade open-ended outputs against a reference with a clear rubric.

Position-bias mitigation

LLM judges favor the first option; randomize or swap order and average to control it.

Reporting

Aggregate per-example scores into a summary stakeholders can act on.

See how it works

Position bias in LLM judges

position bias in an LLM judge

swap the order, keep agreements

A shown 1stB shown 1stverdict

chose Bchose Bconsistent → B

chose Achose Aconsistent → A

chose Achose Bposition bias → discard

chose Bchose Bconsistent → B

chose Bchose Aposition bias → discard

chose Achose Aconsistent → A

chose Bchose Bconsistent → B

chose Achose Bposition bias → discard

Trust only the verdicts that survive a swap. Run a comparison once and the judge may simply prefer whichever answer it saw first. Run it again with the order swapped and you can tell signal from bias: if the verdict flips with position, it was bias and you throw it out; if it names the same model both times, it is a real preference. The win rate over those consistent verdicts is trustworthy, while the naive single-order rate quietly bakes the bias in. Same judge, same answers, and only the discipline changed.

LLM judges systematically favor whichever answer they see first. Swapping order and averaging, or randomizing, is how you get an honest score.

What an eval actually measures

the four eval shapes

LLM-as-judge

A strong model grades the output against a rubric. Flexible and scalable, but biased (position, length, self-preference); de-bias it before trusting it.

Cheap-and-broad, then expensive-and-true. The four ways to score an LLM trade cost against how well they capture real quality. Exact-match is free but only fits closed-answer tasks. A classifier scales a specific property cheaply. An LLM judge is flexible and scalable but biased. Humans are the truth and the bottleneck. You do not pick one; you build a pyramid: cheap graders run on everything, a judge on a large sample, and humans on a small calibrated slice that keeps the cheaper layers honest.

Deterministic metrics give cheap reproducible signal; an LLM judge handles open-ended quality. A real harness uses both.

The scenario

Your team keeps changing the prompt and swapping models, and every change is a coin flip: did quality go up or down? Nobody knows, because there is no harness. Decisions are made on vibes and the last demo that happened to look good.

You have been asked to build the missing piece: a small but honest evaluation harness. Given a model and a labeled dataset, it scores outputs with a deterministic metric and an LLM-as-judge, and it does the judge correctly, accounting for the position bias that quietly corrupts naive LLM grading.

Your role

You are an AI Engineer building an evaluation harness. Your goal is a single module that scores model outputs reproducibly, combines a deterministic metric with a bias-aware LLM-as-judge, and prints a clear per-example and aggregate report.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

OpenAI evals & judging guidance

Patterns for evaluating model outputs.

platform.openai.com

Anthropic: building evals

Designing and running evaluations.

docs.anthropic.com

Judging LLM-as-a-Judge (MT-Bench)

Position bias and other LLM-judge failure modes.

arxiv.org

What you'll build in this evaluation task

This is a build-and-submit task. You build the piece most teams skip: an evaluation harness. Given a model and a small labeled dataset, it scores outputs with a deterministic metric and an LLM-as-judge, and it runs the judge correctly by controlling for the position bias that quietly corrupts naive LLM grading. The deliverable is one Python file that turns prompt and model changes from guesswork into measurement.

The lesson is that judging is itself an engineering problem. A deterministic metric is cheap and reproducible but blind to open-ended quality. An LLM judge handles nuance but favors whichever answer it sees first. A real harness combines them and mitigates the judge's bias by swapping order and averaging, then aggregates everything into a report you can act on.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (provider integration, labeled dataset, deterministic metric, LLM-as-judge, position-bias mitigation, and the report) and returns per-criterion feedback with evidence quoted from your code. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Do I need a real model to run this?

You need a model for the generation and the judge, but a small inexpensive one is perfect, and you can wire a swappable stub so the harness runs offline during development. The rubric rewards the harness design and the bias mitigation, not which model you used.

What is position bias and why does it matter?

When an LLM judges two responses, it tends to prefer whichever appears first, regardless of quality. If you do not control for it, your eval scores reflect ordering as much as quality. Swapping the order and averaging, or randomizing, removes most of the effect.

How big should the dataset be?

Tiny. Three to five labeled examples are enough to demonstrate both metrics, the judge, the bias mitigation, and the report, while keeping the run fast and cheap.

What counts as a complete submission?

A single .py or .ipynb with a labeled dataset, a deterministic metric, an LLM-as-judge that mitigates position bias, and a printed report with per-example and aggregate scores, demonstrated end to end.