System-Prompt Extraction: Recover a RAG Assistant's Hidden Instructions
Red-team Aria, a real Retrieval-Augmented Generation support assistant: a Milvus vector store, NVIDIA embeddings, and an LLM that answers from one shared context window. Confirm a hidden system prompt with an embedded secret exists, recover it through ordinary chat by direct echo, climb the extraction ladder when a stronger refusal posture resists, defeat a naive output filter with encoding-egress, measure the extraction Attack-Success-Rate across the techniques, then ship the durable fix (minimize the secret out of the prompt) and verify extraction yields nothing.
Hands-on labs require Pro · $29.99/mo · cancel anytime
What you'll learn
- 1Recon: map the assistant and confirm a hidden prompt with a secretYou are red-teaming Aria, ACME Cloud's Tier-1 customer support assistant. It
- 2Extract: direct echo of the system promptA system prompt is plain conditioning text. It shares one context window with
- 3Escalate: climb the ladder past a hardened refusalAfter an extraction incident, a real team adds a stronger refusal line to the
- 4Encoding-egress: defeat a naive output filterThe team's next move is an output filter. This step runs with
- 5Measure: extraction ASR across the techniquesYou have three techniques: direct echo, escalation past a hardened refusal, and
- 6Harden: minimize the secret out of the promptYou measured it: direct echo, escalation past a refusal preamble, and
- 7Verify: extraction now yields nothingYou minimized the secret out of the prompt. Now prove the fix with the same
Prerequisites
- Comfortable reading Python
- Basic HTTP and markdown
- No ML background required
Exam domains covered
Skills & technologies you'll practice
This intermediate-level ai/ml lab gives you real-world reps across:
What you'll do in this lab
This is a hands-on offensive-security lab built on a real RAG stack: a Milvus vector store, NVIDIA embeddings, and an LLM answer step. You attack Aria, a working support assistant whose behavior is shaped by a hidden system prompt. A system prompt is plain conditioning text that shares one context window with whatever the user types, so there is no trust boundary between the assistant's instructions and your input. You will recover that hidden instruction text through ordinary chat, measuring how much of the prompt you can reconstruct from the model's own replies.
You build an extraction ladder hands-on, one technique at a time: a direct echo request, a completion and framing-stripping escalation when a hardened refusal posture starts refusing, and encoding-egress (asking the model to emit its instructions Base64-encoded) to slip past a naive cleartext output filter. Because an aligned model is only partially reliable, you then build a small harness that measures the extraction Attack-Success-Rate across the three techniques, with a per-technique breakdown and a structured audit log. You will see why a refusal line in the prompt and a substring filter on the output both raise the bar without making the prompt a security control. Then you flip to defense and ship the durable fix: minimize the secret so the prompt holds nothing worth stealing, and verify the full battery now recovers nothing. Maps to OWASP LLM07:2025 System Prompt Leakage and MITRE ATLAS AML.T0056.