Safety & Guardrails for AI Agents
Hosted
Beta

Safety & Guardrails for AI Agents

Build a guarded IT support agent that blocks jailbreaks, refuses off-topic questions, and safely handles IT queries — using keyword checks, LLM-based validation, and NeMo Guardrails.

35 min·6 steps·3 domains·Intermediate·ncp-aaincp-genl

What you'll learn

  1. 1
    Why Agents Need Guardrails
    LLMs are helpful by default — they try to answer whatever you ask. This is great for productivity, but dangerous when an agent has access to real tools:
  2. 2
    LLM-Based Safety Checks
    Keyword-based checks (step 1) are fast but brittle. A clever attacker can rephrase to bypass them:
  3. 3
    NeMo Guardrails: Hello World
    NeMo Guardrails is NVIDIA's production framework for AI safety. Instead of writing custom Python safety checks, you define guardrails in a declarative configuration — YAML for the model/rails setup, and Colang for conversation flows.
  4. 4
    Input Rails: Jailbreak Prevention
    NeMo Guardrails' self check input pattern automates LLM-based safety checking. Instead of writing Python code to call the LLM and parse the response (like step 2), you declare the policy in YAML and NeMo Guardrails handles everything.
  5. 5
    Topical Rails: Staying On-Topic
    Input rails (step 4) check for harmful content — jailbreaks, attacks, dangerous requests. But what about messages that are harmless but off-topic?
  6. 6
    Guarded Agent: Putting It Together
    Steps 1-5 taught individual guardrail techniques. Now let's combine everything into a complete guarded agent — the production architecture the NCP-AAI exam expects you to understand.

Prerequisites

  • Completed Lab 1 (ReAct Agent) or equivalent
  • Basic understanding of LLM prompt injection risks
  • Familiarity with YAML configuration

Exam domains covered

Safety, Ethics, and ComplianceHuman-AI Interaction and OversightRun, Monitor, and Maintain

What you'll build in this agent safety lab

Guardrails are the line between an agent that ships and one that blows up on the first jailbreak post-launch — and every production LLM app needs them whether or not the launch blog mentions it. This lab wraps an IT-support agent in layered defenses that block prompt injection, refuse off-topic queries, and filter unsafe outputs, all running against NVIDIA NIM endpoints we provision. You finish with a working red-teamed pipeline, the actual artifacts NeMo Guardrails expects (config.yml, rails.co, prompts.yml), and a mental model of why safety logic must live outside the agent the LLM can manipulate.

The substance is defense-in-depth. A keyword matcher as the fast microsecond-level rejection path catches the obvious attempts. An LLM-based safety classifier as a second layer understands intent — rephrased jailbreaks that keyword matching misses. Then NeMo Guardrails' declarative rails wrap the LLM itself: self_check_input blocks malicious prompts before the agent sees them, Colang-defined topical rails (define user ask_about_cookingbot refuse_off_topic) keep the bot inside its scope, and output rails filter impersonation and unsafe responses. You'll see why keyword-only is brittle against rephrased attacks, why you need both input and output rails (the agent can be manipulated, the refusal template can't), and why a red-team matrix of benign plus adversarial inputs is the minimum bar for acceptance.

Prerequisites: Python, prior exposure to a ReAct agent (the react-agent-nim lab works), and comfort with YAML. No Colang experience is assumed — the NeMo Guardrails step introduces it from scratch. The hosted environment ships with nemoguardrails, LangChain, and the NIM integration preinstalled, pointed at our managed NIM proxy — every call (keyword classifier, safety classifier, rails' internal prompts, the main agent) goes through the same endpoint, no keys, no GPU provisioning. About 35 minutes of focused work, finishing with a full red-team test matrix where benign IT questions get real answers and every adversarial class — direct jailbreaks, rephrased jailbreaks, off-topic requests — gets blocked at the right layer.

Frequently asked questions

Why do I need both keyword checks and LLM-based safety checks?

Keyword checks run in microseconds and catch the obvious stuff — they're the cheap first line of defense. LLM-based checks understand intent, so they catch rephrased attacks ("please disregard the system prompt" has none of the usual trigger words but the intent is identical). In production you run keyword matching first as a fast rejection path and only pay for the LLM call when the message gets past it. That's the defense-in-depth pattern Step 6 composes.

What's Colang and why does NeMo Guardrails use it?

Colang is NVIDIA's declarative language for conversation flows inside NeMo Guardrails. You define user intents (define user express_greeting), bot actions (define bot greet), and flows that connect them (when user express_greeting → bot greet). The rails engine uses an embedding-based intent matcher under the hood, so "hi", "hey there", and "howdy" all route to the same intent without you enumerating every variant. It's the mechanism behind topical rails in Step 5.

What does the self_check_input rail actually do?

NeMo Guardrails ships a self check input output parser backed by a prompt template (defined in prompts.yml). When a user message arrives, the rails engine substitutes the message into {{ user_input }}, calls the LLM with the safety prompt, and checks whether the reply contains is_content_safe: true. If not, the message is blocked before it ever reaches the agent — you don't handle parsing or routing, just declare the rail in config.yml.

How is a topical rail different from a safety rail?

A safety rail blocks harmful content (jailbreaks, prompt injection, profanity). A topical rail blocks out-of-scope content that is perfectly safe but shouldn't be answered — an IT bot shouldn't discuss recipes, a finance bot shouldn't give medical advice. Topical rails use Colang intent matching (define user ask_about_cooking) to route off-topic requests to a polite refusal. Step 5 implements exactly this pattern for the Acme IT bot.

Do the guardrails sit inside the agent or outside it?

Outside, deliberately. The whole point is that an agent's LLM can be manipulated by the user — that's what prompt injection is — so the safety logic has to live in code or config the LLM can't rewrite. In Step 6's architecture the request flows user → input guardrails → agent → output guardrails → user. The agent never sees a blocked message, and a blocked response never reaches the user. This separation is what the NCP-AAI exam domain on Safety, Ethics, and Compliance is testing your mental model of.

What proves the guarded agent actually works at the end?

Step 6 runs a red-team test matrix of benign IT questions plus a range of adversarial inputs — direct jailbreak attempts, rephrased versions, off-topic cooking questions, and clean password-reset requests. The check script asserts that benign inputs return a real answer and that every adversarial class is blocked either by the keyword rail, the LLM rail, or the topical rail. A passing Step 6 means your whole pipeline — not just one layer — holds up against the matrix.