Question 1

Is the system prompt a secret?

Accepted Answer

No. A system prompt is conditioning text that the model processes in the
same context window as the user's message. There is no trust boundary
between them, so a model that is asked to repeat or reformat what it was
given will echo its own instructions. Treat the system prompt as readable
and keep real secrets out of it.

Question 2

Why does asking the model to repeat itself work?

Accepted Answer

A chat model is trained to be helpful and to follow instructions. When you
ask it to repeat all sentences in the conversation, or to print everything
above starting with a known phrase, the most helpful completion is to emit
its own conditioning text. Researchers (Zhang, Carlini, Ippolito) recovered
aligned chat models' prompts this way with high precision.

Question 3

What is a hardened refusal posture and does it stop extraction?

Accepted Answer

It is a stronger instruction that tells the model to refuse any request to
repeat, translate, or encode its own instructions. It blunts naive direct
echo, so you escalate: reframe leakage as a formatting or translation task,
or ask for the instructions Base64-encoded so a cleartext filter never sees
them. It raises the bar; it does not make the prompt a control.

Question 4

Do I need an ML background?

Accepted Answer

No. You need to read Python and run a few chat queries. Everything
model-specific is explained inline. The lab is about how an LLM application
treats its own instructions, not about model internals.

System-Prompt Extraction: Recover a RAG Assistant's Hidden Instructions

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll do in this lab

Frequently asked questions

Is the system prompt a secret?

Why does asking the model to repeat itself work?

What is a hardened refusal posture and does it stop extraction?

Do I need an ML background?