Build & submit taskBetaadvanced

Build a Contextual Output-Encoding and Allow-List Mediator

Treat model output as untrusted input to every sink it reaches. Build a per-sink output mediator that applies the correct contextual defense before model text touches an interpreter: allow-list and validate for HTTP egress (the http_fetch SSRF sink), and run an argv array under least privilege with no shell for the system command (the provision_note command-injection sink). Wire it in front of the provided deliberately-vulnerable tool agent, then use a provided SSRF + command-injection proof-of-concept chain as a pass/fail oracle: the exploit must come back blocked at both sinks while a benign control still fetches and provisions. The deliverable is the mediator. Submit a project (or a single script or notebook) for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Model output is untrusted input to every sink

Every place model text becomes an HTTP request URL or a command argument is a sink, and each sink has its own correct control. The defense is to apply that control at the boundary instead of trusting the text on the way out.

Context decides the control

One escape does not fit all. HTTP egress needs an allow-list plus resolved-IP validation, and a system command needs an argument run as inert argv data with no shell. A single mediator dispatches the right control per sink.

Allow-list and resolve before you fetch

An SSRF defense that only blocks the literal string 127.0.0.1 fails on decimal IPs (2130706433), hex (0x7f000001), [::1], a userinfo@host trick, DNS rebinding, and redirects. Validate the scheme and host against an allow-list and resolve the host to an IP, then reject private, loopback, link-local, and metadata ranges with the ipaddress module before any request leaves.

The durable fix lives at the sink

Model-side classifiers and deny-lists each fail independently because an attacker can rephrase around them. A sink-side mediator holds regardless of what the planted ticket talked the model into emitting, which is why allow-listing, resolved-IP validation, and an argv array run without a shell are the defense-in-depth controls that ship.

The scenario

You are the platform engineer who owns the boundary where an LLM application hands model output to its interpreters. The same app a red team already broke: a customer-support tool agent that copies the model's tool arguments into an HTTP fetch and a system command. Neither sink treats model output as untrusted, so a single planted instruction in a support ticket turns into SSRF against the internal metadata service chained into command injection that exfiltrates the value it read. Your job is the fix, and it has to hold.

Your security lead does not want a model-side classifier that an attacker can phrase around. She wants a mediator at each sink that allow-lists, validates, and runs commands without a shell by construction, so model output cannot inject regardless of what the model was talked into saying. The acceptance test is concrete: the provided exploit chain must come back blocked at both sinks, and a benign fetch and a benign provision note must still work with no regression. That mediator, proven against the exploit oracle, is this task.

Your role

You are a defensive security engineer hardening the output boundary of an LLM application. Your goal is a project whose center of gravity is the control: a per-sink output mediator that applies allow-listing and egress validation (scheme, host, IP-literal, and resolved-IP checks) for the HTTP fetch sink, and a least-privilege argv-array handler run without a shell for the system-command sink. You then re-run a provided SSRF + command-injection exploit chain as a pass/fail oracle against a benign control set, with tests and evidence showing the exploit is blocked at both sinks while normal functionality is preserved.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

OWASP LLM05:2025 Improper Output Handling

The primary taxonomy entry: insufficient validation, sanitization, and handling of model output before it reaches downstream components.

genai.owasp.org

OWASP Top 10 for LLM Applications 2025

Full list including LLM06 Excessive Agency, the enabler that turns an output-handling flaw into real impact at the sink.

genai.owasp.org

OWASP Cheat Sheet: Server Side Request Forgery Prevention

Allow-listing, scheme restriction, and resolved-IP validation for the http_fetch egress sink.

cheatsheetseries.owasp.org

OWASP Cheat Sheet: OS Command Injection Defense

Why an argv array run without a shell is the primary defense for the provision_note shell sink.

cheatsheetseries.owasp.org

MITRE ATLAS

Map the blocked classes to the relevant execution and exfiltration techniques and to mitigations (verify the current technique IDs).

atlas.mitre.org

NIST SP 800-53 Rev. 5 (SI-10 Information Input Validation)

The control-family reference for input validation and sanitization at trust boundaries.

csrc.nist.gov

What this task is

This is a build-and-submit defensive-security task, not a quiz about output handling. You produce a project whose deliverable is a control: a per-sink output mediator that treats model output as untrusted input to every interpreter it reaches. The mediator applies an allow-list with resolved-IP validation for the HTTP fetch sink (the SSRF gate) and an argv array run under least privilege with no shell for the system-command sink (the command-injection gate). A provided SSRF plus command-injection proof-of-concept chain is the pass/fail oracle: the exploit must come back blocked at both sinks, while a benign control still fetches the approved status route and records a provision note.

Insecure output handling (OWASP LLM05:2025, with LLM06 excessive agency as the enabler) is the mechanism behind real incidents where model output flowed unsanitized into a sink, including the EchoLeak zero-click exploit that encoded a user's data into a URL the client auto-loaded. The skill this task builds is the defensive counterpart: instead of phrasing an attack, you build the boundary that makes the provided attack inert by construction. A sink-side mediator holds regardless of what the planted ticket talked the model into emitting, which is why it is the durable defense-in-depth control rather than a model-side classifier an attacker can rephrase around.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (the per-sink mediator present and correct, a tested and bypass-resistant mediator that re-runs the provided exploit and blocks it while keeping benign use working, benign functionality preserved, an unmediated-versus-mediated causation contrast, a runnable hermetic harness, and a remediation rationale with standards mapping) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Do I need a paid API key?

No. The whole task runs on the Python standard library: urllib, ipaddress, and socket for the egress validation, subprocess for the argv-array command sink, and http.server for the local metadata SSRF stand-in. The starter kit ships the deliberately-vulnerable target and a working exploit, so you build the control and verify it offline with no key. The rubric rewards the mediator and the blocked oracle, not which model you used.

What is the actual deliverable?

The control. You build a per-sink output mediator that applies the correct defense before model output reaches an interpreter: a host allow-list plus an SSRF guard (scheme, IP-literal, and resolved-IP checks) for the HTTP fetch sink, and an argv array run with shell=False for the system-command sink. The provided SSRF plus command-injection exploit chain is only the pass/fail oracle that proves the mediator works.

How is the mediator judged to work?

The provided exploit must be re-run through your mediator and shown blocked at both sinks (no SSRF reaches the internal metadata service, no injected command drops the secret), with an unmediated-versus-mediated contrast proving the control is the cause. At the same time a benign control must still fetch the approved status route and record a provision note, so you do not pass by simply blocking everything. The mediator has to be both safe and functional, and tested.

Why a sink-side mediator instead of a model-side filter?

Because a model-side classifier or string deny-list can be rephrased around, and incidents like EchoLeak showed input-side defenses failing independently. A sink-side mediator that allow-lists, resolves and validates the IP, and runs commands without a shell by construction holds regardless of what the model output says, which makes it the durable defense-in-depth control the platform team can ship.