Build an LLM Evaluation Harness
An LLM you can't measure is one you can't trust. Build an evaluation harness that scores model outputs with a deterministic metric and an LLM-as-judge that mitigates position bias, then prints a report with per-example and aggregate scores. Submit a single script or notebook for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
6
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
See how it works
Position bias in LLM judges
LLM judges systematically favor whichever answer they see first. Swapping order and averaging, or randomizing, is how you get an honest score.
What an eval actually measures
Deterministic metrics give cheap reproducible signal; an LLM judge handles open-ended quality. A real harness uses both.
The scenario
Your team keeps changing the prompt and swapping models, and every change is a coin flip: did quality go up or down? Nobody knows, because there is no harness. Decisions are made on vibes and the last demo that happened to look good.
You have been asked to build the missing piece: a small but honest evaluation harness. Given a model and a labeled dataset, it scores outputs with a deterministic metric and an LLM-as-judge, and it does the judge correctly, accounting for the position bias that quietly corrupts naive LLM grading.
Your role
You are an AI Engineer building an evaluation harness. Your goal is a single module that scores model outputs reproducibly, combines a deterministic metric with a bias-aware LLM-as-judge, and prints a clear per-example and aggregate report.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What you'll build in this evaluation task
This is a build-and-submit task. You build the piece most teams skip: an evaluation harness. Given a model and a small labeled dataset, it scores outputs with a deterministic metric and an LLM-as-judge, and it runs the judge correctly by controlling for the position bias that quietly corrupts naive LLM grading. The deliverable is one Python file that turns prompt and model changes from guesswork into measurement.
The lesson is that judging is itself an engineering problem. A deterministic metric is cheap and reproducible but blind to open-ended quality. An LLM judge handles nuance but favors whichever answer it sees first. A real harness combines them and mitigates the judge's bias by swapping order and averaging, then aggregates everything into a report you can act on.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (provider integration, labeled dataset, deterministic metric, LLM-as-judge, position-bias mitigation, and the report) and returns per-criterion feedback with evidence quoted from your code. The pass threshold is 65 percent and you can resubmit.