Defend: Secret Isolation for a RAG Assistant
Hosted · ide
Beta

Defend: Secret Isolation for a RAG Assistant

Harden the same RAG support assistant that the extraction lab broke, in small sequential steps. A live signing key, an internal build id, and a canary token are baked into the system prompt, so the secret is exposed by construction: it shares one context window with the customer's message. Stand the service up, reproduce the exposure (the secret is present in the model's context), and watch a naive cleartext output filter fall to encoding-egress. Then build the durable control one mechanism per step: a vault boundary that holds the secret out of the model's context, a seeded canary tripwire, and a fail-closed decoding leak detector that matches the secret and its Base64/ROT13/hex forms. Verify the secret is unrecoverable and benign answers are intact, then resist a re-planted, re-encoded exfil battery.

80 min8 steps3 domainsAdvanced

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface
Query
Retriever
LLM
Poisoned doc
retrieved chunk
Answer
0%
Attack-success rate
Attacks blocked · benign answers pass
graded on real output, not the model's talk

What you'll learn

  1. 1
    Stand up Aria and trace one benign request
    You own Aria, ACME Cloud's Tier-1 support assistant, after a red team
  2. 2
    Reproduce: the secret is exposed by construction
    The problem is in dvrag.py. At startup it reads support_secrets.env and bakes
  3. 3
    Naive fix bypassed: a cleartext filter falls to encoding-egress
    The team's first reaction is the obvious one: scrub the output. They shipped
  4. 4
    Mechanism 1: the vault boundary (secret out of the model context)
    Time to build the durable control. It has three mechanisms, and you build them one
  5. 5
    Mechanism 2: a seeded canary tripwire
    The secret is out of the context now, so prompt extraction recovers a clean prompt.
  6. 6
    Mechanism 3: a fail-closed decoding leak detector
    You have the vault boundary (mechanism 1) and the canary (mechanism 2). The last
  7. 7
    Verify: secret unrecoverable, leaks fail-closed, benign answers intact
    You built three mechanisms: the vault boundary keeps the secret out of the context,
  8. 8
    Resist: re-planted and encoded exfil attempts all blocked
    A control that only blocks the exact payload you tested is not a control. This step

Prerequisites

  • Comfortable reading and editing Python
  • Basic HTTP, markdown, and Base64/ROT13 encoding
  • Helpful to have seen a prompt-extraction attack first

Exam domains covered

Defensive AI SecurityLLM Application SecuritySensitive Information Disclosure

Skills & technologies you'll practice

This advanced-level ai/ml lab gives you real-world reps across:

Sensitive Information DisclosureSystem Prompt LeakageSecret IsolationCanary TokenOutput FilteringOWASP LLM02OWASP LLM07RAGDefensive SecurityAI Red TeamMITRE ATLAS

What you'll do in this lab

This is a hands-on defensive-security lab built on a real RAG stack: a Milvus vector store, NVIDIA embeddings, and an LLM answer step. You harden Aria, a working support assistant whose system prompt carries real secret material: a live signing key, an internal build identifier, and a canary token. Because the system prompt and the customer's message share one context window, a prompt-extraction request reads the secret straight back. You start by reproducing that leak with the attacker's own exploit, so the control you build is measured against a real bypass and not a toy one.

You ship the obvious fix first, a cleartext output filter, and watch encoding-egress slip past it when the model emits the secret Base64-encoded. Then you build the durable control by hand: secret isolation that keeps the key and canary out of the model's context entirely behind a tool/vault boundary, a unique canary token as an unambiguous tripwire, and a canonicalizing output leak detector that decodes Base64, ROT13, and hex before it blocks. The final step re-plants a fresh exploit each run and confirms it is blocked while a normal answer still works. Maps to OWASP LLM02:2025 Sensitive Information Disclosure and LLM07:2025 System Prompt Leakage, and MITRE ATLAS AML.T0056 / AML.T0057.

Frequently asked questions

Why move the secret out of the system prompt instead of telling the model to keep it?

A system prompt is conditioning text that the model processes in the same context window as the user's message. There is no trust boundary between them, so a model asked to repeat or reformat what it was given will echo its own instructions. You cannot reliably instruct a model to keep a secret it can read. Keeping the secret out of the context entirely, behind a tool or vault boundary, is the control that holds.

What is a canary token and how does it help defend an LLM app?

A canary is a unique string that is never needed to answer a customer and never placed in the prompt. Because it has no legitimate reason to appear in a reply, its presence in model output (in any encoding) is unambiguous proof that a secret path leaked. It gives an output detector a high-signal trigger and follows the deception-token pattern in MITRE ATLAS AML.T0024.

Why does a cleartext output filter fail, and what replaces it?

A substring filter only matches cleartext, so it never sees output the model encoded for egress: ask for the secret Base64-encoded or ROT13 and the literal string never appears. The fix is a canonicalizing detector that decodes Base64, ROT13, and hex candidates first, then checks the decoded form against the canary and known secret terms. It is defense in depth on top of keeping the secret out of the context.

Will hardening break normal answers?

It should not. Secret isolation removes material the model never needed to answer customers, and the output detector only fires on genuine secret terms, so benign replies pass through untouched. The final step verifies exactly this: a re-planted exploit is blocked and a normal customer question still gets a real answer.