Build & submit taskBetaadvanced

Chain an Output-Handling Flaw to a Real Sink, Then Ship the Fix

Build a self-contained proof-of-concept that exploits insecure output handling: a planted indirect instruction makes a model's output flow into a real sink (an HTTP client, a SQL engine, a shell, or a markdown renderer) and produce concrete impact (SSRF reads an internal secret, a command or query runs, or a record exfiltrates through a rendered image). Add a benign baseline, apply a sink-side remediation, re-run the exploit to prove it is blocked with no regression, and write the finding the way a pentester would. Submit a single script or notebook for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Model output is somebody's input

Every place model text becomes markup, a shell command, a SQL string, an HTTP request, or another tool's arguments is a sink. The bug is trusting that text on the way out.

Shape the arguments, do not jailbreak the model

You influence the arguments a model passes to a tool it already calls. Each action looks legitimate, so aligned models comply and refusal rates stay low.

Rendering is a request

An auto-loaded markdown image turns model output into an attacker-controlled outbound GET, the EchoLeak zero-click exfil channel.

The fix lives at the sink

EchoLeak's input classifier, link redaction, and CSP each failed independently. The durable controls are sink-side: allow-list, parameterize, least-privilege tools, and a safe-markdown subset.

The scenario

You're on a red-team engagement against an internal LLM system. It is either a tool-using agent (DV-ToolAgent-shaped: it fetches URLs, queries a database, and runs a transform helper) or a RAG assistant whose chat UI renders the model's markdown. The rules of engagement: you cannot touch the weights and you cannot social-engineer staff. You can plant one piece of content the system later reads, the same foothold an attacker gets by filing a support ticket, editing a wiki page, or sending an email the assistant summarizes.

Your lead does not want 'the model said something weird.' She wants a reproducible proof-of-concept that drives model output into a real interpreter, a clear statement of impact, and a remediation the platform team can ship and that you have proven blocks your own exploit. That deliverable, an output-handling chain plus a sink-side fix, is this task.

Your role

You are an offensive security engineer auditing an LLM application's output handling. Your goal is a single, self-contained file that proves an output-handling exploit end to end: a planted indirect instruction flows through model output into a real sink and produces concrete impact, then a sink-side remediation defeats your own proof-of-concept without breaking normal use.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

OWASP LLM05:2025 Improper Output Handling

The primary taxonomy entry for this task.

genai.owasp.org

OWASP LLM06:2025 Excessive Agency

Why over-broad tools turn an output-handling flaw into real impact.

genai.owasp.org

MITRE ATLAS

Map your chain to the execution and exfiltration tactics (verify the current technique IDs).

atlas.mitre.org

EchoLeak (CVE-2025-32711)

The real-world zero-click render-exfil exploit this task is modeled on.

securityweek.com

What this task is

This is a build-and-submit offensive-security task, not a quiz about output handling. You produce a single file that chains an output-handling flaw to a real sink end to end: a planted indirect instruction drives model output into an HTTP client, a SQL engine, a shell, a code interpreter, or a markdown renderer, and produces concrete impact such as SSRF reading an internal secret or a record exfiltrating through a rendered image. You add a benign baseline, apply a sink-side remediation, and re-run your own exploit to prove it is blocked with no regression.

Insecure output handling (OWASP LLM05:2025, with LLM06 excessive agency as the enabler) is the mechanism behind real incidents like EchoLeak, the first zero-click exploit against a production LLM system, which encoded a user's own data into a markdown image URL the client auto-loaded. The skill this task builds is the one that separates a real AI red teamer from someone who can make a chatbot say something rude: shape the arguments a model passes to a tool, drive the output into an interpreter, turn it into impact, and report it so the sink-side fix actually ships.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (runnable PoC, the output sink identified, demonstrated impact, a benign baseline, a proven remediation, a minimal sink-side fix, and the written finding) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Do I need a paid API key?

No. The starter kit runs offline on the standard library alone with a deterministic naive-model stub, so you can finish with no key. You can also target any model you can reach, or build your own harness. The rubric rewards the chain and the fix, not which model you used.

What counts as an output sink?

Any interpreter the model's output flows into: an HTTP client (SSRF), a SQL engine, a shell or code interpreter, or a markdown renderer that auto-loads images. The bug is the application trusting model output and passing it to that interpreter unsanitized.

How is impact graded?

You must demonstrate a real outcome with evidence in your run output: a secret exfiltrated to a local listener, an SSRF reaching an internal/metadata stand-in, a command or SQL statement running, or a record dropped, plus a benign baseline showing the same input is harmless without the payload.

Why does the task require a sink-side remediation?

Because EchoLeak proved model-side defenses (input classifiers, link redaction, CSP) each failed independently. A finding is only useful if it ships a durable fix at the sink: an allow-list, a parameterized query, a removed capability, or a sanitizer, and you re-run your own exploit against it to prove it holds.