Question 1

Why move the secret out of the system prompt instead of telling the model to keep it?

Accepted Answer

A system prompt is conditioning text that the model processes in the same
context window as the user's message. There is no trust boundary between
them, so a model asked to repeat or reformat what it was given will echo its
own instructions. You cannot reliably instruct a model to keep a secret it
can read. Keeping the secret out of the context entirely, behind a tool or
vault boundary, is the control that holds.

Question 2

What is a canary token and how does it help defend an LLM app?

Accepted Answer

A canary is a unique string that is never needed to answer a customer and
never placed in the prompt. Because it has no legitimate reason to appear in
a reply, its presence in model output (in any encoding) is unambiguous proof
that a secret path leaked. It gives an output detector a high-signal trigger
and follows the deception-token pattern in MITRE ATLAS AML.T0024.

Question 3

Why does a cleartext output filter fail, and what replaces it?

Accepted Answer

A substring filter only matches cleartext, so it never sees output the model
encoded for egress: ask for the secret Base64-encoded or ROT13 and the
literal string never appears. The fix is a canonicalizing detector that
decodes Base64, ROT13, and hex candidates first, then checks the decoded form
against the canary and known secret terms. It is defense in depth on top of
keeping the secret out of the context.

Question 4

Will hardening break normal answers?

Accepted Answer

It should not. Secret isolation removes material the model never needed to
answer customers, and the output detector only fires on genuine secret terms,
so benign replies pass through untouched. The final step verifies exactly
this: a re-planted exploit is blocked and a normal customer question still
gets a real answer.

Defend: Secret Isolation for a RAG Assistant

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll do in this lab

Frequently asked questions

Why move the secret out of the system prompt instead of telling the model to keep it?

What is a canary token and how does it help defend an LLM app?

Why does a cleartext output filter fail, and what replaces it?

Will hardening break normal answers?