System-Prompt Extraction: Recover a RAG Assistant's Hidden Instructions
Hosted · ide
Beta

System-Prompt Extraction: Recover a RAG Assistant's Hidden Instructions

Red-team Aria, a real Retrieval-Augmented Generation support assistant: a Milvus vector store, NVIDIA embeddings, and an LLM that answers from one shared context window. Confirm a hidden system prompt with an embedded secret exists, recover it through ordinary chat by direct echo, climb the extraction ladder when a stronger refusal posture resists, defeat a naive output filter with encoding-egress, measure the extraction Attack-Success-Rate across the techniques, then ship the durable fix (minimize the secret out of the prompt) and verify extraction yields nothing.

70 min7 steps3 domainsIntermediate

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface
Query
Retriever
LLM
Poisoned doc
retrieved chunk
Answer
0%
Attack-success rate
Attacks blocked · benign answers pass
graded on real output, not the model's talk

What you'll learn

  1. 1
    Recon: map the assistant and confirm a hidden prompt with a secret
    You are red-teaming Aria, ACME Cloud's Tier-1 customer support assistant. It
  2. 2
    Extract: direct echo of the system prompt
    A system prompt is plain conditioning text. It shares one context window with
  3. 3
    Escalate: climb the ladder past a hardened refusal
    After an extraction incident, a real team adds a stronger refusal line to the
  4. 4
    Encoding-egress: defeat a naive output filter
    The team's next move is an output filter. This step runs with
  5. 5
    Measure: extraction ASR across the techniques
    You have three techniques: direct echo, escalation past a hardened refusal, and
  6. 6
    Harden: minimize the secret out of the prompt
    You measured it: direct echo, escalation past a refusal preamble, and
  7. 7
    Verify: extraction now yields nothing
    You minimized the secret out of the prompt. Now prove the fix with the same

Prerequisites

  • Comfortable reading Python
  • Basic HTTP and markdown
  • No ML background required

Exam domains covered

Offensive AI SecurityLLM Application SecuritySystem Prompt Leakage

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

System Prompt LeakagePrompt ExtractionOWASP LLM07RAGOffensive SecurityAI Red TeamMITRE ATLAS

What you'll do in this lab

This is a hands-on offensive-security lab built on a real RAG stack: a Milvus vector store, NVIDIA embeddings, and an LLM answer step. You attack Aria, a working support assistant whose behavior is shaped by a hidden system prompt. A system prompt is plain conditioning text that shares one context window with whatever the user types, so there is no trust boundary between the assistant's instructions and your input. You will recover that hidden instruction text through ordinary chat, measuring how much of the prompt you can reconstruct from the model's own replies.

You build an extraction ladder hands-on, one technique at a time: a direct echo request, a completion and framing-stripping escalation when a hardened refusal posture starts refusing, and encoding-egress (asking the model to emit its instructions Base64-encoded) to slip past a naive cleartext output filter. Because an aligned model is only partially reliable, you then build a small harness that measures the extraction Attack-Success-Rate across the three techniques, with a per-technique breakdown and a structured audit log. You will see why a refusal line in the prompt and a substring filter on the output both raise the bar without making the prompt a security control. Then you flip to defense and ship the durable fix: minimize the secret so the prompt holds nothing worth stealing, and verify the full battery now recovers nothing. Maps to OWASP LLM07:2025 System Prompt Leakage and MITRE ATLAS AML.T0056.

Frequently asked questions

Is the system prompt a secret?

No. A system prompt is conditioning text that the model processes in the same context window as the user's message. There is no trust boundary between them, so a model that is asked to repeat or reformat what it was given will echo its own instructions. Treat the system prompt as readable and keep real secrets out of it.

Why does asking the model to repeat itself work?

A chat model is trained to be helpful and to follow instructions. When you ask it to repeat all sentences in the conversation, or to print everything above starting with a known phrase, the most helpful completion is to emit its own conditioning text. Researchers (Zhang, Carlini, Ippolito) recovered aligned chat models' prompts this way with high precision.

What is a hardened refusal posture and does it stop extraction?

It is a stronger instruction that tells the model to refuse any request to repeat, translate, or encode its own instructions. It blunts naive direct echo, so you escalate: reframe leakage as a formatting or translation task, or ask for the instructions Base64-encoded so a cleartext filter never sees them. It raises the bar; it does not make the prompt a control.

Do I need an ML background?

No. You need to read Python and run a few chat queries. Everything model-specific is explained inline. The lab is about how an LLM application treats its own instructions, not about model internals.