Build & submit taskBetaadvanced

Build a CI Red-Team Harness That Gates on Attack-Success-Rate

Build a tested CI red-team harness: it runs a battery of attacks across at least five classes against a target, scores each attempt with a deterministic oracle, triages true vs false positives, computes per-class and overall attack-success-rate, emits a machine-readable redteam_report.json, and exits non-zero when ASR crosses a threshold so the build fails. You prove it with tests and by showing the gate fire RED on a vulnerable build and GREEN on a hardened one. Submit the project as a .zip for instant, rubric-based feedback.

3 hrs

Est. time

Outcomes

Rubric criteria

65%

Pass score

What you'll learn

Skills you'll have real reps in after shipping this.

ASR is only as good as its oracle

Anchor success to a deterministic application effect (sentinel or listener callback). An LLM judge makes the number flake and the gate untrustworthy.

Triage is the measurement

A scanner over-reports. Until you drop the false positives (refusals, verbal play-along), the ASR is inflated and the gate is meaningless.

Aligned models resist blatant jailbreaks

Direct-injection and jailbreak classes mostly read clean against an aligned model. The reliable finding is the application flaw: poison the corpus, ask a benign question.

A fix is a regression test

Once the gate is red on the vulnerable build and green on the hardened one, any future regression re-trips it with no new work.

The scenario

Manual testing does not scale and does not guard against regression. Your team has a small Retrieval-Augmented Generation (RAG) assistant in production and a habit of shipping prompt and retrieval changes weekly. Every release risks reopening a prompt-injection or data-leak hole someone already fixed, and nobody can prove a fix held until it breaks again in the wild.

Your lead wants the automated half of the red-team practice: a harness that runs a battery of attacks on every commit, computes attack-success-rate, triages the scanner's false positives, writes a report a dashboard can read, and fails CI when the rate crosses an agreed threshold. That harness, red on the vulnerable build and green on the hardened one, is this task.

Your role

You are an offensive security engineer standing up red-team automation for an LLM application. Your goal is a small, tested project: a harness that runs a multi-class attack battery, scores it against a deterministic oracle, triages true vs false positives, measures attack-success-rate, emits a machine-readable report, and acts as a CI gate that fails the build when the application regresses.

Start the task to unlock the full brief

You'll get the step-by-step requirements, setup commands, the 6-criterion grading rubric, tips, and the ability to submit your solution for instant AI grading.

Free to start · submit when you're ready

Learning resources

garak: LLM vulnerability scanner

NVIDIA's batteries-included LLM fuzzer and scanner, the discovery tool this harness complements.

github.com

PyRIT: Python Risk Identification Toolkit

Microsoft's programmable red-team framework for composing targets, attacks, and scorers.

github.com

OWASP Top 10 for LLM Applications (2025)

LLM01 injection, LLM05 output handling, LLM07 system-prompt leakage: the classes the battery probes.

genai.owasp.org

MITRE ATLAS: AML.T0051 LLM Prompt Injection

The technique and its direct and indirect sub-techniques.

atlas.mitre.org

What this task is

This is a build-and-submit task, not a quiz about red-team tooling. You produce a tested harness project that runs a battery of attacks across at least five classes against an LLM application, scores each attempt with a deterministic oracle, triages true vs false positives, computes attack-success-rate, emits a machine-readable redteam_report.json, and acts as a CI gate that fails the build when attack-success-rate crosses a threshold. You prove it with tests and by showing the gate red on a vulnerable build and green on a hardened one.

The skill here is measurement and regression discipline. A scan that prints FAIL is not actionable until you can say how often the attack succeeded and how you know it succeeded. You anchor success to an application-level effect, an exfil-listener callback or a sentinel substring, and you drop the scanner's false positives (refusals, verbal play-along), so the number is the same on every run and your CI gate does not flake. This maps to OWASP LLM01 and LLM05, the OWASP Agentic Top 10 entries ASI01 and ASI06, and MITRE ATLAS AML.T0051.

Grading is rubric-based and explainable. Your submission is scored against weighted criteria (runnable harness, five-class battery, a tested harness whose ASR is correct and whose gate actually fires on regression, the machine-readable report, a deterministic oracle with triage, and the design rationale) with per-criterion feedback. The pass threshold is 65 percent and you can resubmit.

Frequently asked questions

Do I need a paid API key?

No. The starter kit ships a deterministic stub model and runs offline on the standard library alone, so the ASR and the gate are repeatable with no key. You can also point it at any OpenAI-compatible endpoint. The rubric rewards the tested harness, the ASR measurement, and the working gate, not which model you used.

Why must the oracle be deterministic?

Because the harness gates CI. An LLM judge returns slightly different verdicts run to run, so attack-success-rate drifts and the gate flakes red and green on identical code. Anchor success to a sentinel substring or a listener callback so the number is repeatable, and triage out the false positives.

Which attack classes actually fire?

Against an aligned model, direct-injection and jailbreak classes mostly read clean, and your report should say so. The reliable finding is the application flaw: poison your own RAG corpus and send a benign question, the indirect-injection class, and watch the account record exfiltrate through the render sink.

What does the CI gate need to do?

Compute overall attack-success-rate from true positives, compare it to a threshold, and exit non-zero when it is exceeded so the build fails. You must prove it with tests and show it red on a vulnerable build and green on a hardened one, which is the regression test a real practice relies on.