Defend a RAG Assistant: Build a Guardrail Layer and an Attack-Success-Rate CI Gate
Hosted · ide
Beta

Defend a RAG Assistant: Build a Guardrail Layer and an Attack-Success-Rate CI Gate

You inherit DV-RAG-Support with a working EchoLeak-style exploit, and you defend it in small sequential steps. Stand the assistant up and trace one benign request, then reproduce the leak: an indirect prompt injection makes the model echo a customer's account record into a markdown image, and the renderer fires it as an outbound request that exfiltrates the data. Watch a naive host deny-list get bypassed by a renamed host, then build the durable guardrail one mechanism per step: an egress allow-list on the render sink so only approved hosts load, then output redaction of the sensitive record so even an approved-host request carries nothing. Verify the sink is closed with benign answers intact, then stand up an attack-success-rate (ASR) gate that runs a probe battery on every change: green while guardrails hold, non-zero the moment a regression re-opens the sink, green again when you back it out. Finish by proving fresh, renamed, and paraphrased payloads are all blocked.

85 min8 steps3 domainsAdvanced

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface
Query
Retriever
LLM
Poisoned doc
retrieved chunk
Answer
0%
Attack-success rate
Attacks blocked · benign answers pass
graded on real output, not the model's talk

What you'll learn

  1. 1
    Stand up DV-RAG and trace one benign request through the guarded wrapper
    You are the defender on DV-RAG-Support, ACME Cloud's customer-support
  2. 2
    Reproduce the attack: prove the EchoLeak sink fires before you defend it
    You traced what normal looks like. Now reproduce the leak end to end, because a fix
  3. 3
    Watch the naive deny-list get bypassed by one renamed host
    An alert fired. The team saw the exfil callback to 127.0.0.1 and shipped the
  4. 4
    Control mechanism 1: an egress allow-list on the render sink
    Time to build the durable control. It has two mechanisms, and you build them one
  5. 5
    Control mechanism 2: redact the sensitive record (and screen the input)
    The allow-list from Step 4 stops the callback to an untrusted host. It does not
  6. 6
    Verify: the EchoLeak sink is closed and benign answers are intact
    You built the two control mechanisms one at a time: an egress allow-list on the
  7. 7
    Stand up the attack-success-rate gate and prove it is a real tripwire
    A fix that holds today can quietly stop holding tomorrow. The control you built in
  8. 8
    Resist bypass: fresh, renamed-host, and paraphrased attacks all blocked
    A control that holds against the one payload you tested is not yet a control. The

Prerequisites

  • Comfortable reading and editing Python
  • Completed (or understand) the Module 2 indirect prompt injection exploit on DV-RAG-Support
  • No ML background required

Exam domains covered

Defensive AI SecurityLLM Application SecurityRed-Team Automation

Skills & technologies you'll practice

This advanced-level ai/ml lab gives you real-world reps across:

GuardrailsDefensive AI SecurityAttack Success RateCI Security GateIndirect Prompt InjectionImproper Output HandlingOWASP LLM01OWASP LLM05AI Red Team

What you'll do in this lab

This is a hands-on defensive AI security lab. You harden DV-RAG-Support, a Retrieval-Augmented Generation (RAG) customer-support assistant that ships with a real, working exploit: an indirect prompt injection (OWASP LLM01) hidden in a knowledge-base document coaxes the model into leaking a customer's account record (OWASP LLM02 sensitive information disclosure) through an auto-rendered markdown image, the improper-output-handling sink (OWASP LLM05) behind the real EchoLeak exploit. You reproduce the leak first so you can prove your fix later.

Then you build the defense the way a platform team would, one mechanism per step. You watch a shallow deny-list fix get bypassed by swapping one hostname, the lesson that earns the durable control. You close the render sink with an egress allow-list so only approved hosts load, then redact the sensitive record fields in the output so even an approved-host request carries nothing, and add an input guardrail that treats retrieved context as untrusted data and strips embedded instructions. You verify both mechanisms hold with benign answers intact. Finally you stand up an attack-success-rate (ASR) gate: a battery of attack and benign probes that computes the share of attacks that still succeed and fails the build when ASR crosses a threshold, then you prove fresh, renamed, and paraphrased payloads are all blocked.

Frequently asked questions

What is an LLM guardrail layer?

A guardrail layer is code that wraps a model call and inspects what goes in and what comes out. An input guardrail screens the prompt and any retrieved context for injected instructions before the model sees them; an output guardrail screens the response for policy violations (here, exfiltration images and leaked record fields) before any downstream action like rendering runs. In this lab you build both around an existing vulnerable assistant rather than rewriting the model.

What is an attack-success-rate (ASR) gate?

ASR is the fraction of attack attempts that achieve the attacker's goal. An ASR gate runs a fixed battery of attack probes (which must fail) and benign probes (which must still work) on every change, computes ASR, and exits non-zero when ASR rises above a set threshold. Wiring it into CI means a future edit that re-opens the leak breaks the build instead of shipping silently. It is the automation that turns "we fixed it once" into "it stays fixed."

Why does a deny-list of the attacker's host not work?

A deny-list blocks only the bad values you already know. The attacker's listener on 127.0.0.1 is the same loopback host as localhost, as 127.1, and as a decimal-encoded address, so blocking one spelling leaves the others open. An allow-list of the small set of hosts you actually trust to render flips the default to deny and closes the variants you never enumerated. You prove this in the lab by bypassing your own naive fix.

Do I need a machine-learning background?

No. You read and edit Python and reason about whether an attack actually fired. Everything model-specific is explained inline, and the target ships a deterministic offline mode so your control logic is testable without a live model. The skill on test here is building durable controls and a regression gate, not model internals.