Defend a RAG Assistant: Build a Guardrail Layer and an Attack-Success-Rate CI Gate
You inherit DV-RAG-Support with a working EchoLeak-style exploit, and you defend it in small sequential steps. Stand the assistant up and trace one benign request, then reproduce the leak: an indirect prompt injection makes the model echo a customer's account record into a markdown image, and the renderer fires it as an outbound request that exfiltrates the data. Watch a naive host deny-list get bypassed by a renamed host, then build the durable guardrail one mechanism per step: an egress allow-list on the render sink so only approved hosts load, then output redaction of the sensitive record so even an approved-host request carries nothing. Verify the sink is closed with benign answers intact, then stand up an attack-success-rate (ASR) gate that runs a probe battery on every change: green while guardrails hold, non-zero the moment a regression re-opens the sink, green again when you back it out. Finish by proving fresh, renamed, and paraphrased payloads are all blocked.
Hands-on labs require Pro · $29.99/mo · cancel anytime
What you'll learn
- 1Stand up DV-RAG and trace one benign request through the guarded wrapperYou are the defender on DV-RAG-Support, ACME Cloud's customer-support
- 2Reproduce the attack: prove the EchoLeak sink fires before you defend itYou traced what normal looks like. Now reproduce the leak end to end, because a fix
- 3Watch the naive deny-list get bypassed by one renamed hostAn alert fired. The team saw the exfil callback to 127.0.0.1 and shipped the
- 4Control mechanism 1: an egress allow-list on the render sinkTime to build the durable control. It has two mechanisms, and you build them one
- 5Control mechanism 2: redact the sensitive record (and screen the input)The allow-list from Step 4 stops the callback to an untrusted host. It does not
- 6Verify: the EchoLeak sink is closed and benign answers are intactYou built the two control mechanisms one at a time: an egress allow-list on the
- 7Stand up the attack-success-rate gate and prove it is a real tripwireA fix that holds today can quietly stop holding tomorrow. The control you built in
- 8Resist bypass: fresh, renamed-host, and paraphrased attacks all blockedA control that holds against the one payload you tested is not yet a control. The
Prerequisites
- Comfortable reading and editing Python
- Completed (or understand) the Module 2 indirect prompt injection exploit on DV-RAG-Support
- No ML background required
Exam domains covered
Skills & technologies you'll practice
This advanced-level ai/ml lab gives you real-world reps across:
What you'll do in this lab
This is a hands-on defensive AI security lab. You harden DV-RAG-Support, a Retrieval-Augmented Generation (RAG) customer-support assistant that ships with a real, working exploit: an indirect prompt injection (OWASP LLM01) hidden in a knowledge-base document coaxes the model into leaking a customer's account record (OWASP LLM02 sensitive information disclosure) through an auto-rendered markdown image, the improper-output-handling sink (OWASP LLM05) behind the real EchoLeak exploit. You reproduce the leak first so you can prove your fix later.
Then you build the defense the way a platform team would, one mechanism per step. You watch a shallow deny-list fix get bypassed by swapping one hostname, the lesson that earns the durable control. You close the render sink with an egress allow-list so only approved hosts load, then redact the sensitive record fields in the output so even an approved-host request carries nothing, and add an input guardrail that treats retrieved context as untrusted data and strips embedded instructions. You verify both mechanisms hold with benign answers intact. Finally you stand up an attack-success-rate (ASR) gate: a battery of attack and benign probes that computes the share of attacks that still succeed and fails the build when ASR crosses a threshold, then you prove fresh, renamed, and paraphrased payloads are all blocked.