Question 1

What is an LLM guardrail layer?

Accepted Answer

A guardrail layer is code that wraps a model call and inspects what goes in
and what comes out. An input guardrail screens the prompt and any retrieved
context for injected instructions before the model sees them; an output
guardrail screens the response for policy violations (here, exfiltration
images and leaked record fields) before any downstream action like rendering
runs. In this lab you build both around an existing vulnerable assistant
rather than rewriting the model.

Question 2

What is an attack-success-rate (ASR) gate?

Accepted Answer

ASR is the fraction of attack attempts that achieve the attacker's goal. An
ASR gate runs a fixed battery of attack probes (which must fail) and benign
probes (which must still work) on every change, computes ASR, and exits
non-zero when ASR rises above a set threshold. Wiring it into CI means a
future edit that re-opens the leak breaks the build instead of shipping
silently. It is the automation that turns "we fixed it once" into "it stays
fixed."

Question 3

Why does a deny-list of the attacker's host not work?

Accepted Answer

A deny-list blocks only the bad values you already know. The attacker's
listener on 127.0.0.1 is the same loopback host as localhost, as 127.1, and
as a decimal-encoded address, so blocking one spelling leaves the others
open. An allow-list of the small set of hosts you actually trust to render
flips the default to deny and closes the variants you never enumerated. You
prove this in the lab by bypassing your own naive fix.

Question 4

Do I need a machine-learning background?

Accepted Answer

No. You read and edit Python and reason about whether an attack actually
fired. Everything model-specific is explained inline, and the target ships a
deterministic offline mode so your control logic is testable without a live
model. The skill on test here is building durable controls and a regression
gate, not model internals.

Defend a RAG Assistant: Build a Guardrail Layer and an Attack-Success-Rate CI Gate

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll do in this lab

Frequently asked questions

What is an LLM guardrail layer?

What is an attack-success-rate (ASR) gate?

Why does a deny-list of the attacker's host not work?

Do I need a machine-learning background?