Chain an Output-Handling Flaw to a Real Sink, Then Ship the Fix
Build a self-contained proof-of-concept that exploits insecure output handling: a planted indirect instruction makes a model's output flow into a real sink (an HTTP client, a SQL engine, a shell, or a markdown renderer) and produce concrete impact (SSRF reads an internal secret, a command or query runs, or a record exfiltrates through a rendered image). Add a benign baseline, apply a sink-side remediation, re-run the exploit to prove it is blocked with no regression, and write the finding the way a pentester would. Submit a single script or notebook for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
7
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
The scenario
You're on a red-team engagement against an internal LLM system. It is either a tool-using agent (DV-ToolAgent-shaped: it fetches URLs, queries a database, and runs a transform helper) or a RAG assistant whose chat UI renders the model's markdown. The rules of engagement: you cannot touch the weights and you cannot social-engineer staff. You can plant one piece of content the system later reads, the same foothold an attacker gets by filing a support ticket, editing a wiki page, or sending an email the assistant summarizes.
Your lead does not want 'the model said something weird.' She wants a reproducible proof-of-concept that drives model output into a real interpreter, a clear statement of impact, and a remediation the platform team can ship and that you have proven blocks your own exploit. That deliverable, an output-handling chain plus a sink-side fix, is this task.
Your role
You are an offensive security engineer auditing an LLM application's output handling. Your goal is a single, self-contained file that proves an output-handling exploit end to end: a planted indirect instruction flows through model output into a real sink and produces concrete impact, then a sink-side remediation defeats your own proof-of-concept without breaking normal use.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What this task is
This is a build-and-submit offensive-security task, not a quiz about output handling. You produce a single file that chains an output-handling flaw to a real sink end to end: a planted indirect instruction drives model output into an HTTP client, a SQL engine, a shell, a code interpreter, or a markdown renderer, and produces concrete impact such as SSRF reading an internal secret or a record exfiltrating through a rendered image. You add a benign baseline, apply a sink-side remediation, and re-run your own exploit to prove it is blocked with no regression.
Insecure output handling (OWASP LLM05:2025, with LLM06 excessive agency as the enabler) is the mechanism behind real incidents like EchoLeak, the first zero-click exploit against a production LLM system, which encoded a user's own data into a markdown image URL the client auto-loaded. The skill this task builds is the one that separates a real AI red teamer from someone who can make a chatbot say something rude: shape the arguments a model passes to a tool, drive the output into an interpreter, turn it into impact, and report it so the sink-side fix actually ships.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (runnable PoC, the output sink identified, demonstrated impact, a benign baseline, a proven remediation, a minimal sink-side fix, and the written finding) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.