Build & submit taskBetaintermediate

Build a Reliable, Long-Context Claude Agent with Crash Recovery

Build an agent that survives long context and failures: mitigate lost-in-the-middle, use a scratchpad for large work, recover from a crash via a manifest and exported state, propagate structured error context, route a stratified sample to human review, and track provenance to resolve source conflicts. Submit a single script or notebook for instant, rubric-based feedback.

3.5 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Lost-in-the-middle

Models attend best to the start and end of a long context. Place the important parts there, or retrieve and place only what matters.

Scratchpad strategy

Writing intermediate findings to external memory keeps the working context small and stops large jobs from degrading.

Crash recovery

A manifest of done-units and results lets a killed run resume from where it stopped instead of restarting.

Review and provenance

Stratified sampling gives reviewers a representative slice, and per-fact provenance lets you resolve conflicts and trace any answer.

See how it works

Lost in the middle

Where the fact sits in the prompt changes whether the model finds it.

Where the fact sits

Beginning of the prompt

Model recalls the fact

94%

Recall accuracy by needle position on a 64k-token prompt. Numbers are representative of published needle-in-haystack benchmarks; the U-shape is robust across models.

A model recalls content at the start and end of a long context far better than content buried in the middle. Layout is a reliability decision.

Scratchpad over raw history

short-term vs long-term memory

Context windowin the prompt, this run

this task's goal + state

recent tool results

recent conversation

full, older turns fall off the edge

speed

instant (already in prompt)

size

bounded by the window

lifetime

gone when the run ends

The window for now; a store for forever. Short-term memory is the context window: whatever is in the prompt this lap, instant to use but bounded in size and erased when the run ends. Long-term memory is an external store you read from and write to, durable and effectively unlimited, at the cost of a fetch (often a similarity search) to pull the relevant piece into the prompt. You reach for the window for the active task and the store for anything that must outlive the run or exceed its size. Either way, agent memory is mostly a database problem.

Writing findings to external memory and re-injecting a summary keeps the working context small, so long jobs do not degrade as they fill up.

The scenario

Your agent processes long documents and large codebases. It works on small inputs and then quietly gets worse on big ones: it misses facts buried in the middle of a long context, it loses its place when the process is killed halfway through, and when two sources disagree it picks one with no record of why. A reviewer has no idea which outputs to spot-check.

You are going to make it reliable. Lay out context so the important parts are not lost, keep a scratchpad for long work, checkpoint state so a crash can resume, propagate errors with enough context to debug, route a representative sample to humans, and track where every fact came from.

Your role

You are a Claude solutions architect responsible for reliability. Your deliverable is one module that processes a long input dependably: lost-in-the-middle mitigation, a scratchpad, crash-safe state, structured error propagation, stratified human review, and provenance with conflict resolution.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

Anthropic API: messages

The Messages API and context construction.

docs.anthropic.com

Long context tips

Working with long context windows on Claude.

docs.anthropic.com

Lost in the Middle (Liu et al.)

Why position in a long context affects what the model uses.

arxiv.org

What you'll build in this context-and-reliability task

This is a build-and-submit task, not a guided lab. You make a Claude agent reliable on the inputs that break naive ones: long documents, large codebases, and runs that get interrupted. The deliverable is one Python module that lays out context to avoid lost-in-the-middle, keeps a scratchpad, checkpoints state for crash recovery, and tracks where every fact came from.

The work here is what separates a demo from something you can leave running. You place important context where the model actually uses it, you offload intermediate findings to a scratchpad so the working context stays small, you checkpoint to a manifest so a killed run resumes instead of restarting, you propagate structured error context, you route a representative stratified sample to human review, and you record provenance so source conflicts are resolved with a rule and a trail.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (SDK integration, lost-in-the-middle, scratchpad, crash recovery, error propagation, stratified review, and provenance) with per-criterion feedback quoted from your code. The pass threshold is 65 percent and you can resubmit. These are the context-management and reliability skills the Claude Certified Architect exam tests.

Frequently asked questions

What is lost-in-the-middle and how do I mitigate it?

Models recall information at the beginning and end of a long context far better than information in the middle. You mitigate it by placing the question and the most relevant passages at the edges, or by retrieving only what matters and placing it deliberately, rather than dumping everything in original order.

What does crash recovery need?

A manifest file that records which work units are done and their results. On startup the agent loads the manifest and skips completed units, so a process killed halfway resumes instead of restarting. The task asks you to demonstrate a simulated crash and a resume.

Why stratified sampling for human review?

Reviewing the first N outputs over-samples whatever comes first. Stratified sampling draws across categories or confidence bands, so reviewers see a representative slice of the whole distribution and catch issues that a head-of-list sample would miss.

What counts as a complete submission?

A single .py or .ipynb on the Anthropic SDK that mitigates lost-in-the-middle, uses a scratchpad, checkpoints to a manifest with a demonstrated crash and resume, propagates structured errors, routes a stratified review sample, and records provenance with conflict resolution.