Safety & Guardrails for AI Agents
Build a guarded IT support agent that blocks jailbreaks, refuses off-topic questions, and safely handles IT queries — using keyword checks, LLM-based validation, and NeMo Guardrails.
What you'll learn
- 1Why Agents Need GuardrailsLLMs are helpful by default — they try to answer whatever you ask. This is great for productivity, but dangerous when an agent has access to real tools:
- 2LLM-Based Safety ChecksKeyword-based checks (step 1) are fast but brittle. A clever attacker can rephrase to bypass them:
- 3NeMo Guardrails: Hello WorldNeMo Guardrails is NVIDIA's production framework for AI safety. Instead of writing custom Python safety checks, you define guardrails in a declarative configuration — YAML for the model/rails setup, and Colang for conversation flows.
- 4Input Rails: Jailbreak PreventionNeMo Guardrails' self check input pattern automates LLM-based safety checking. Instead of writing Python code to call the LLM and parse the response (like step 2), you declare the policy in YAML and NeMo Guardrails handles everything.
- 5Topical Rails: Staying On-TopicInput rails (step 4) check for harmful content — jailbreaks, attacks, dangerous requests. But what about messages that are harmless but off-topic?
- 6Guarded Agent: Putting It TogetherSteps 1-5 taught individual guardrail techniques. Now let's combine everything into a complete guarded agent — the production architecture the NCP-AAI exam expects you to understand.
Prerequisites
- Completed Lab 1 (ReAct Agent) or equivalent
- Basic understanding of LLM prompt injection risks
- Familiarity with YAML configuration
Exam domains covered
What you'll build in this agent safety lab
Guardrails are the line between an agent that ships and one that blows up on the first jailbreak post-launch — and every production LLM app needs them whether or not the launch blog mentions it. This lab wraps an IT-support agent in layered defenses that block prompt injection, refuse off-topic queries, and filter unsafe outputs, all running against NVIDIA NIM endpoints we provision. You finish with a working red-teamed pipeline, the actual artifacts NeMo Guardrails expects (config.yml, rails.co, prompts.yml), and a mental model of why safety logic must live outside the agent the LLM can manipulate.
The substance is defense-in-depth. A keyword matcher as the fast microsecond-level rejection path catches the obvious attempts. An LLM-based safety classifier as a second layer understands intent — rephrased jailbreaks that keyword matching misses. Then NeMo Guardrails' declarative rails wrap the LLM itself: self_check_input blocks malicious prompts before the agent sees them, Colang-defined topical rails (define user ask_about_cooking → bot refuse_off_topic) keep the bot inside its scope, and output rails filter impersonation and unsafe responses. You'll see why keyword-only is brittle against rephrased attacks, why you need both input and output rails (the agent can be manipulated, the refusal template can't), and why a red-team matrix of benign plus adversarial inputs is the minimum bar for acceptance.
Prerequisites: Python, prior exposure to a ReAct agent (the react-agent-nim lab works), and comfort with YAML. No Colang experience is assumed — the NeMo Guardrails step introduces it from scratch. The hosted environment ships with nemoguardrails, LangChain, and the NIM integration preinstalled, pointed at our managed NIM proxy — every call (keyword classifier, safety classifier, rails' internal prompts, the main agent) goes through the same endpoint, no keys, no GPU provisioning. About 35 minutes of focused work, finishing with a full red-team test matrix where benign IT questions get real answers and every adversarial class — direct jailbreaks, rephrased jailbreaks, off-topic requests — gets blocked at the right layer.
Frequently asked questions
Why do I need both keyword checks and LLM-based safety checks?
What's Colang and why does NeMo Guardrails use it?
define user express_greeting), bot actions (define bot greet), and flows that connect them (when user express_greeting → bot greet). The rails engine uses an embedding-based intent matcher under the hood, so "hi", "hey there", and "howdy" all route to the same intent without you enumerating every variant. It's the mechanism behind topical rails in Step 5.What does the self_check_input rail actually do?
self check input output parser backed by a prompt template (defined in prompts.yml). When a user message arrives, the rails engine substitutes the message into {{ user_input }}, calls the LLM with the safety prompt, and checks whether the reply contains is_content_safe: true. If not, the message is blocked before it ever reaches the agent — you don't handle parsing or routing, just declare the rail in config.yml.How is a topical rail different from a safety rail?
define user ask_about_cooking) to route off-topic requests to a polite refusal. Step 5 implements exactly this pattern for the Acme IT bot.Do the guardrails sit inside the agent or outside it?
user → input guardrails → agent → output guardrails → user. The agent never sees a blocked message, and a blocked response never reaches the user. This separation is what the NCP-AAI exam domain on Safety, Ethics, and Compliance is testing your mental model of.