Write Up an Indirect Prompt-Injection Finding
Build a self-contained proof-of-concept that exploits an LLM application through indirect prompt injection: deliver your payload via retrieved or external data, demonstrate a concrete impact (exfiltration or an unauthorized action), then write the finding the way a pentester would, with a severity rating and a remediation you prove blocks your own exploit. Submit a single script or notebook for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
6
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
The scenario
You're on a red-team engagement against an internal LLM assistant: a small Retrieval-Augmented Generation (RAG) app that answers staff questions from a document store. The rules of engagement are simple. You cannot social-engineer employees and you cannot touch the model weights. You can get one document into the knowledge base, the same foothold an attacker gets by filing a support ticket, editing a wiki page, or sending an email the assistant later summarizes.
Your lead wants more than 'the chatbot said something weird.' She wants a reproducible proof-of-concept, a clear statement of impact, and a remediation the platform team can ship. That deliverable, an exploit plus a fix, is this task.
Your role
You are an offensive security engineer auditing an LLM application. Your goal is a single, self-contained file that proves an indirect prompt-injection exploit end to end, states its impact and severity like a professional finding, and demonstrates a remediation that defeats your own proof-of-concept.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What this task is
This is a build-and-submit offensive-security task, not a quiz about prompt injection. You produce a single file that proves an indirect prompt-injection exploit end to end: a small target, a payload delivered through retrieved or external data rather than the victim's prompt, a demonstrated impact such as data exfiltration, a benign baseline for contrast, and a remediation you re-run the exploit against to prove it holds.
Indirect prompt injection (OWASP LLM01, MITRE ATLAS AML.T0051) is the mechanism behind real-world incidents like EchoLeak, the first zero-click exploit against a production LLM system. The skill this task builds is the one that separates a real AI red teamer from someone who can make a chatbot say something rude: deliver a payload the victim never sees, turn it into concrete impact, and report it so the fix actually ships.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (runnable PoC, indirect delivery, demonstrated impact, benign baseline, proven remediation, and the written finding) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.