Build & submit taskBetaintermediate

Engineer a Claude Decision Prompt and Prove It with Eval

Turn a vague classification prompt into a precise one: explicit criteria that cut false positives, 2-4 targeted few-shot examples, an unbiased LLM-as-judge with position-bias mitigation and multi-instance review, all measured against a labeled set with the Batch API. Submit a single script or notebook for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Explicit criteria beat vague guidance

Precise, testable criteria cut false positives far more than telling the model to use its judgment.

Targeted few-shot

2-4 examples on the confusing cases teach the boundary better than many easy examples.

Unbiased judging

An LLM-as-judge favors position and verbosity. Swapping order and running multiple instances removes that bias.

Measure it

A labeled set and a false-positive rate turn 'the prompt feels better' into a number you can defend.

See how it works

Why judges need order-swapping

position bias in an LLM judge

swap the order, keep agreements

A shown 1stB shown 1stverdict

chose Bchose Bconsistent → B

chose Achose Aconsistent → A

chose Achose Bposition bias → discard

chose Bchose Bconsistent → B

chose Bchose Aposition bias → discard

chose Achose Aconsistent → A

chose Bchose Bconsistent → B

chose Achose Bposition bias → discard

Trust only the verdicts that survive a swap. Run a comparison once and the judge may simply prefer whichever answer it saw first. Run it again with the order swapped and you can tell signal from bias: if the verdict flips with position, it was bias and you throw it out; if it names the same model both times, it is a real preference. The win rate over those consistent verdicts is trustworthy, while the naive single-order rate quietly bakes the bias in. Same judge, same answers, and only the discipline changed.

An LLM-as-judge tends to favor whichever answer it sees first. Evaluating both orders and requiring a win both ways removes the position bias.

Targeted few-shot

Few-shot shifts the boundary

systemClassify the message as billing, account, or technical. Reply with only the label.

(no examples, the model only has the instruction)

I can't sign in even after resetting my password.technical ✗

The export button does nothing when I click it.technical

Cancel my subscription and refund the last charge.account ✗

Examples teach the convention. With no examples the model guesses "technical" for a login problem. A couple of labeled user→assistant turns pin the boundary (password → account), and the held-out cases snap to the right label.

A few examples on the confusing boundary cases teach the decision far better than many easy ones, and cut false positives.

The scenario

A content-moderation prompt your team wrote says 'flag anything inappropriate.' It flags half the harmless messages and misses some real ones, and nobody can say how good it actually is. When they tried an LLM-as-judge to grade it, the judge favored whichever answer was shown first.

You are going to fix the prompt and prove it. Replace vague guidance with explicit criteria, add a few targeted examples, and build an evaluation that is itself unbiased: multiple instances, swapped order, and a labeled set so you can show the false-positive rate actually dropped.

Your role

You are a Claude solutions architect responsible for prompt quality. Your deliverable is one module that engineers a precise decision prompt and measures it honestly with an unbiased LLM-as-judge.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

Anthropic: prompt engineering

Explicit instructions and examples.

docs.anthropic.com

Anthropic Message Batches API

Running the evaluation efficiently in bulk.

docs.anthropic.com

Reducing latency and improving quality

Building evaluations for prompts.

docs.anthropic.com

What you'll build in this prompt evaluation task

This is a build-and-submit task, not a guided lab. You take a vague decision prompt and make it precise, then prove the improvement with an evaluation that is itself unbiased. The deliverable is one Python module that engineers explicit criteria and few-shot examples and measures the false-positive rate against a labeled set.

The techniques here are the ones the exam keeps returning to: explicit criteria beat vague guidance for cutting false positives, 2-4 targeted examples beat a pile of easy ones, and an LLM-as-judge has to be debiased (swapped order, multiple instances) before its scores mean anything. You run the whole evaluation efficiently with the Batch API and report the numbers.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (SDK integration, explicit criteria, few-shot, the judge, bias mitigation, batch evaluation, and the reported improvement) with per-criterion feedback quoted from your code. The pass threshold is 65 percent and you can resubmit. These are the prompt-engineering skills the Claude Certified Architect exam tests.

Frequently asked questions

How is this different from the structured-output task?

That task extracts typed data with tool_use and a schema. This one is about prompt quality for decisions: explicit criteria versus vague guidance, targeted few-shot, and an unbiased LLM-as-judge that proves the false-positive rate dropped. Different slice of the prompt-engineering domain.

Why mitigate position bias in the judge?

LLM judges favor whichever answer appears first and tend to prefer longer answers. If you do not swap the order and aggregate multiple runs, your evaluation measures the bias rather than the prompt, and you cannot trust the result.

How big does the labeled set need to be?

Small is fine. A dozen examples with known labels is enough to show the false-positive rate moving between the vague baseline and the precise prompt. The point is to measure, not to build a benchmark.

What counts as a complete submission?

A single .py or .ipynb on the Anthropic SDK with a vague baseline and a precise prompt (explicit criteria plus 2-4 targeted examples), a criteria-based LLM-as-judge with position-bias mitigation and multiple instances, a Batch API evaluation over a labeled set, and reported baseline-vs-precise numbers.