Build & submit taskBetaintermediate

Build a Headless Claude Code Review Bot for CI

Automate code review with Claude in CI: run it headlessly against a diff, return a structured pass/fail verdict the pipeline can parse, run it as an independent review session rather than self-review, and gate the build on the result. Submit a single script for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Headless Claude in CI

claude -p with --output-format json (or the Agent SDK) gives the pipeline a parseable result instead of prose to scrape.

Structured verdicts

A JSON verdict with a pass/fail decision and severity-tagged issues is what lets CI act, not a wall of text.

Independent review

A fresh context reviewing a change catches what the session that wrote it will not. Self-review rubber-stamps.

Build gating

Exiting non-zero on a failing verdict turns the review into a real gate, not advisory output.

The scenario

Reviews on your team are inconsistent and slow, and an earlier attempt at automating them piped Claude's prose output into grep, so the pipeline could never reliably tell pass from fail. Worse, the same session that wrote a change was asked to review it, so it rubber-stamped its own work.

You are going to build a real review bot: it runs Claude headlessly against the diff, returns a structured verdict the CI can act on, reviews from a fresh independent context, and blocks the build when it finds a real problem.

Your role

You are a Claude solutions architect automating code review. Your deliverable is one script that runs Claude headlessly on a change, produces a structured verdict, reviews independently, and gates the build.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

Claude Code headless / CI

Running Claude Code with -p and structured output.

docs.anthropic.com

Claude Agent SDK

Driving Claude programmatically for automation.

docs.anthropic.com

Claude Code best practices

Independent review and automation patterns.

anthropic.com

What you'll build in this Claude review bot task

This is a build-and-submit task, not a guided lab. You build a real code-review bot on Claude: it runs headlessly against a diff in CI, returns a structured pass/fail verdict the pipeline can act on, reviews from an independent context, and blocks the build when it finds a genuine problem. The deliverable is one script you could wire into a pipeline.

The patterns here are the ones that make AI review trustworthy instead of theatrical. You stop scraping prose and demand a structured verdict, you feed the actual change rather than a summary, you review from a fresh session so the bot is not grading its own homework, and you gate the build on severity so the review has teeth.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (headless invocation, the structured verdict, independent review, build gating, real-change input, and team standards) with per-criterion feedback quoted from your code. The pass threshold is 65 percent and you can resubmit. These are the Claude Code workflow-automation skills the Claude Certified Architect exam tests.

Frequently asked questions

Claude Code CLI or the Agent SDK?

Either. You can shell out to claude -p with --output-format json, or drive Claude programmatically with the Agent SDK. The rubric rewards the headless invocation, the structured verdict, and the gating, not which path you took.

Why independent review instead of self-review?

The session that wrote a change shares its blind spots and tends to rubber-stamp it. A fresh context whose only job is to review the diff against the standards catches issues the author missed. Making the review independent is the point.

Do I need a real repo and CI?

No. A sample diff (or two, one passing and one failing) and a local invocation that exits non-zero on failure are enough to demonstrate the bot and its gating behavior.

What counts as a complete submission?

A single script that invokes Claude headlessly on a diff, returns a parsed JSON verdict with severity-tagged issues, runs as an independent review, gates the build via exit code, enforces standards from a committed file, and demonstrates a passing and a failing diff.