Build & submit taskBetaadvanced

Write Up an Indirect Prompt-Injection Finding

Build a self-contained proof-of-concept that exploits an LLM application through indirect prompt injection: deliver your payload via retrieved or external data, demonstrate a concrete impact (exfiltration or an unauthorized action), then write the finding the way a pentester would, with a severity rating and a remediation you prove blocks your own exploit. Submit a single script or notebook for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

Indirect vs direct injection

The delivery channel is the whole story. A payload in retrieved data is a zero-click attack the victim never sees.

Impact-first reporting

A finding is judged on demonstrated impact, not on making the model misbehave in the abstract.

Output handling as an exfil channel

Rendered images, tool calls, and executed code turn model output into attacker-controlled actions.

Remediation you can prove

A fix only counts when you re-run the exploit against it and show it fails.

The scenario

You're on a red-team engagement against an internal LLM assistant: a small Retrieval-Augmented Generation (RAG) app that answers staff questions from a document store. The rules of engagement are simple. You cannot social-engineer employees and you cannot touch the model weights. You can get one document into the knowledge base, the same foothold an attacker gets by filing a support ticket, editing a wiki page, or sending an email the assistant later summarizes.

Your lead wants more than 'the chatbot said something weird.' She wants a reproducible proof-of-concept, a clear statement of impact, and a remediation the platform team can ship. That deliverable, an exploit plus a fix, is this task.

Your role

You are an offensive security engineer auditing an LLM application. Your goal is a single, self-contained file that proves an indirect prompt-injection exploit end to end, states its impact and severity like a professional finding, and demonstrates a remediation that defeats your own proof-of-concept.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

OWASP LLM01: Prompt Injection

The taxonomy entry covering direct and indirect injection.

genai.owasp.org

MITRE ATLAS: AML.T0051 LLM Prompt Injection

The technique and its indirect sub-technique.

atlas.mitre.org

EchoLeak (CVE-2025-32711)

The real-world zero-click exploit this finding pattern is modeled on.

securityweek.com

What this task is

This is a build-and-submit offensive-security task, not a quiz about prompt injection. You produce a single file that proves an indirect prompt-injection exploit end to end: a small target, a payload delivered through retrieved or external data rather than the victim's prompt, a demonstrated impact such as data exfiltration, a benign baseline for contrast, and a remediation you re-run the exploit against to prove it holds.

Indirect prompt injection (OWASP LLM01, MITRE ATLAS AML.T0051) is the mechanism behind real-world incidents like EchoLeak, the first zero-click exploit against a production LLM system. The skill this task builds is the one that separates a real AI red teamer from someone who can make a chatbot say something rude: deliver a payload the victim never sees, turn it into concrete impact, and report it so the fix actually ships.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (runnable PoC, indirect delivery, demonstrated impact, benign baseline, proven remediation, and the written finding) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Do I need a paid API key?

No. You can target any hosted model you have access to, or simulate a naive model deterministically and say so. The rubric rewards the exploit and the report, not which model you used.

What counts as 'indirect' injection?

The payload must reach the model through data it retrieves or ingests, a document, a tool result, a web page, an email, rather than through the prompt the victim typed. That is what makes it a zero-click attack.

How is impact graded?

You must demonstrate a real outcome, such as exfiltrating a secret the model can see or triggering an unauthorized action, with evidence in your run output, plus a benign baseline showing the same input is harmless without the payload.

Why does the task require a remediation?

Because a finding is only useful if it ships a fix. You implement a remediation and re-run your own exploit against it to prove it is blocked, the same standard a real engagement holds you to.