Engineer a Claude Decision Prompt and Prove It with Eval
Turn a vague classification prompt into a precise one: explicit criteria that cut false positives, 2-4 targeted few-shot examples, an unbiased LLM-as-judge with position-bias mitigation and multi-instance review, all measured against a labeled set with the Batch API. Submit a single script or notebook for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
7
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
See how it works
Why judges need order-swapping
An LLM-as-judge tends to favor whichever answer it sees first. Evaluating both orders and requiring a win both ways removes the position bias.
Targeted few-shot
A few examples on the confusing boundary cases teach the decision far better than many easy ones, and cut false positives.
The scenario
A content-moderation prompt your team wrote says 'flag anything inappropriate.' It flags half the harmless messages and misses some real ones, and nobody can say how good it actually is. When they tried an LLM-as-judge to grade it, the judge favored whichever answer was shown first.
You are going to fix the prompt and prove it. Replace vague guidance with explicit criteria, add a few targeted examples, and build an evaluation that is itself unbiased: multiple instances, swapped order, and a labeled set so you can show the false-positive rate actually dropped.
Your role
You are a Claude solutions architect responsible for prompt quality. Your deliverable is one module that engineers a precise decision prompt and measures it honestly with an unbiased LLM-as-judge.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What you'll build in this prompt evaluation task
This is a build-and-submit task, not a guided lab. You take a vague decision prompt and make it precise, then prove the improvement with an evaluation that is itself unbiased. The deliverable is one Python module that engineers explicit criteria and few-shot examples and measures the false-positive rate against a labeled set.
The techniques here are the ones the exam keeps returning to: explicit criteria beat vague guidance for cutting false positives, 2-4 targeted examples beat a pile of easy ones, and an LLM-as-judge has to be debiased (swapped order, multiple instances) before its scores mean anything. You run the whole evaluation efficiently with the Batch API and report the numbers.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (SDK integration, explicit criteria, few-shot, the judge, bias mitigation, batch evaluation, and the reported improvement) with per-criterion feedback quoted from your code. The pass threshold is 65 percent and you can resubmit. These are the prompt-engineering skills the Claude Certified Architect exam tests.