Build & submit taskBetaadvanced

Build a Guardrail Layer and an Attack-Success-Rate CI Gate

Build the defensive control: an input/output guardrail layer that sits in front of a vulnerable LLM app. The starter kit provides the target, a probe battery, and a CI-gate harness that computes attack-success-rate (ASR) over the battery and fails the build when ASR crosses a threshold, used only as the pass/fail oracle. With your guardrail wired in, battery ASR must drop below the threshold and the gate must exit zero; the naive starter stub and a planted regression leave ASR over the line so the gate exits non-zero. A benign control suite must keep passing so you prove no over-blocking. Write a short remediation rationale and submit the project as a .zip for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

The control is the deliverable, the exploit is the oracle

A guardrail is only credible if a repeatable attack battery says the attack-success-rate fell and a benign suite says you did not break the product. The exploit exists to measure the control, not the other way around.

Guard the input and the output

Screen or taint what reaches the model (retrieved chunks, tool results, external content) and mediate what leaves it (the sink). A single layer is brittle. Two layers plus least privilege is defense in depth.

Sink-side beats model-side

An egress allow-list, a safe-markdown subset, and a parameterized tenant boundary hold even when a prompt refactor reopens a hole. EchoLeak showed input classifiers, link redaction, and CSP each failed independently, so the durable control lives at the sink.

A gate turns a fix into a regression test

Once ASR is below threshold and the gate is green, a deterministic oracle plus a non-zero exit means any future regression re-trips the build with no new work. The planted-regression run is how you prove the gate actually fails when it should.

The scenario

You own defense for a small Retrieval-Augmented Generation (RAG) assistant or tool-using agent that ships changes weekly. The offensive side already proved it: a planted document or tool result can drive the app into leaking a tenant record or firing an out-of-scope request, and a one-off manual fix does not survive the next prompt refactor. Your lead does not want another patch that quietly regresses. She wants a durable control and a gate that proves the control is still holding on every commit.

Your job is the defensive half of the red-team practice. Build a guardrail layer that mediates what reaches the model and what leaves it, then wire a CI gate that runs an attack battery, computes attack-success-rate, and fails the build when ASR crosses an agreed threshold. The exploit battery is your oracle, not your goal. The deliverable is the guardrail layer plus the gate logic: ASR below threshold with the guardrails on, the gate going red on a planted regression, and a benign control suite that keeps passing so you have not traded security for a broken product.

Your role

You are a defensive security engineer hardening an LLM application and standing up its regression gate. Your goal is a single, self-contained file that builds an input/output guardrail layer in front of a vulnerable target, runs a provided exploit battery as a deterministic oracle to measure attack-success-rate, and acts as a CI gate that fails the build when ASR crosses a threshold. You must show ASR below threshold with the guardrails on, a benign control suite still passing, and the gate firing red on a planted regression then green once it is reverted.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 7-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

OWASP Top 10 for LLM Applications (2025)

The taxonomy your guardrails defend against, primarily LLM01 prompt injection and LLM05 improper output handling.

genai.owasp.org

OWASP LLM05:2025 Improper Output Handling

Why the output guard belongs at the sink and what improper handling looks like.

genai.owasp.org

OWASP Agentic AI Top 10 (ASI)

Agentic threats and mitigations to map your input/output mediation and least-privilege controls against.

genai.owasp.org

MITRE ATLAS

Map the probes and your mitigations to the relevant techniques (verify the current technique IDs, for example AML.T0051 LLM Prompt Injection).

atlas.mitre.org

NIST AI Risk Management Framework (AI RMF 1.0)

The Measure and Manage functions that an ASR gate and a guardrail layer operationalize for continuous defense.

nist.gov

What this task is

This is a build-and-submit defensive-security task, not a quiz about guardrails. You build the control: an input/output guardrail layer that sits in front of a vulnerable LLM app. The starter kit provides the target, the probe battery, and a CI-gate harness that computes attack-success-rate (ASR) over the battery and fails the build when ASR crosses a threshold. The provided battery and harness are the pass/fail oracle, so the center of gravity is the control you ship, not the attack. This is the inverse of the offensive harness task: there you build the harness, here it is given and you build the defense it measures.

The skill here is durable, measurable defense. You screen or taint what reaches the model and mediate what leaves it with a sink-side control such as an egress allow-list (ipaddress and urlparse), a safe-markdown subset that drops auto-loaded images, or a parameterized tenant boundary. Then you prove it: ASR falls below the threshold with the guardrails on, a benign control suite keeps passing so you have not over-blocked, and a planted regression pushes ASR back over the line so the gate goes red, then green once reverted. This maps to OWASP LLM01 and LLM05, the OWASP Agentic Top 10 (ASI), and MITRE ATLAS, with the NIST AI RMF Measure and Manage functions as the framing.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (the guardrail layer being present and correct, ASR dropping below threshold, the gate going red on a planted regression and green once fixed, the benign suite still passing, a deterministic oracle, and the remediation rationale) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Is this an attack task or a defense task?

It is a defense task. You build a hardened component, an input/output guardrail layer plus an ASR CI gate. A provided exploit battery is used only as the deterministic oracle that proves attack-success-rate dropped. The deliverable is the control, not the exploit.

Do I need a paid API key?

No. The starter kit ships a deterministic stub model and runs offline on the standard library alone, so the ASR and the gate are repeatable with no key. You can also point the target at any OpenAI-compatible endpoint. The rubric rewards the guardrail layer, the ASR dropping below threshold, the gate going green, and a benign suite that still passes, not which model you used.

What makes a good guardrail here?

A two-sided, sink-side control. An input guard that taints and screens retrieved/tool/external content, and an output guard that is a durable primitive at the sink: an egress allow-list (ipaddress + urlparse), a safe-markdown subset that drops auto-loaded images, a parameterized tenant boundary, or escaping. A model-side classifier counts only as added depth, never the primary control.

Why must the gate go red on a planted regression?

Because a gate you have only seen pass has not been proven to fail. You weaken one guard so ASR climbs back over the threshold, watch the gate exit non-zero, then revert and watch it pass again. Red, then green, is the evidence that the regression test actually guards the fix on every commit.