Evaluate an Agent with LLM-as-Judge
Hosted
Beta

Evaluate an Agent with LLM-as-Judge

Build an eval harness that scores agent responses automatically — correctness via a reference-based judge, plus an accuracy metric and A/B comparison. Same pattern used by NeMo Evaluator for production agent evaluation.

30 min·5 steps·3 domains·Intermediate·ncp-aai

What you'll learn

  1. 1
    Load a test dataset
    A working agent is not a good agent. Before you ship one, you need numbers: *how often does it get the answer right? how robust is its tone? how often does it call the wrong tool?* Those numbers come from an eval harness.
  2. 2
    Generate agent responses
    Before scoring, you need outputs to score. For this lab we'll use a minimal agent: a NIM client with a short system prompt. In your production project this slot is filled by your real agent (a ReAct loop, a LangGraph supervisor, etc.), but the eval harness is identical.
  3. 3
    Build an LLM-as-judge
    The naive approach — prediction == reference — fails fast. "Paris" equals "Paris", but "the city of Paris" doesn't, and neither of those equals "It's Paris.". String equality is too brittle.
  4. 4
    Score the dataset and aggregate
    A single {"score":"correct"} is a data point. You need an aggregate metric to make decisions. Two easy ones:
  5. 5
    Compare two agent variants
    One score on its own tells you very little. Evals are most useful *comparatively* — is version B better than version A? Swap a prompt, a model, or a tool, rerun the harness, compare numbers.

Prerequisites

  • Basic Python (functions, dicts, list comprehensions)
  • Completed `react-agent-nim` or equivalent agent exposure
  • Familiarity with JSON

Exam domains covered

Agent Evaluation and ObservabilityAgent DevelopmentNVIDIA Platform Implementation

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

EvaluationLLM-as-JudgeRagasNeMo EvaluatorNIM

What you'll build in this LLM-as-judge eval lab

A working agent is not a good agent, and every team that's shipped a production LLM app has learned the hard way that you can't eyeball 200 responses and call it tested. This lab builds the four-stage evaluation pipeline every serious agent team runs — dataset, predictions, judgments, aggregate metric — and finishes with an A/B comparison of two system prompts scored on the same questions through the same judge. You walk away with a working LLM-as-judge harness, accuracy and invalid_rate numbers you actually trust, and a mental model that ports directly to NVIDIA NeMo Evaluator (same shape, smaller footprint). Everything runs on NVIDIA NIM endpoints we provision.

The heart of the lab is LLM-as-judge with structured output. You prompt a second model with the question, reference answer, and candidate prediction and require JSON like {"score": "correct" | "incorrect", "reason": "..."} that your code validates with json.loads. You'll see why string equality fails on natural paraphrases ("Paris" vs "the city of Paris"), why free-text judgments can't be aggregated (you can't compute accuracy from prose), and why the invalid_rate metric is the underappreciated signal that catches a broken judge before you start drawing conclusions about your agent. The A/B step refactors run_agent to take a system_prompt argument so you can isolate prompt changes as a single variable — the minimum useful shape of any eval comparison.

Prerequisites: Python with list comprehensions and dicts, basic familiarity with JSON, and prior exposure to an agent (the react-agent-nim lab is the natural entry point). The hosted environment ships with the OpenAI Python SDK pointed at our managed NIM proxy — both the system-under-test calls and the judge calls run against Nemotron via the same endpoint, no keys, no GPU pod. About 30 minutes of focused work. You leave with a DATASET of ground-truth rows, a PREDICTIONS list, a judge() that returns parseable structured output, aggregate accuracy plus invalid-rate metrics, and A/B numbers on two prompt variants — the same artifacts NeMo Evaluator produces on managed runs.

Frequently asked questions

What does LLM-as-judge mean and why not just compare strings?

LLM-as-judge means using a second LLM to evaluate whether a prediction matches a reference on semantic grounds rather than character-by-character. prediction == reference fails on natural paraphrases: "Paris", "the city of Paris", and "It's Paris." are all correct answers to "What is the capital of France?" but only one string matches. A judge LLM reads the reference, reads the prediction, applies a rubric (e.g., "score correct if the prediction matches the reference in meaning"), and returns a structured verdict you can aggregate into metrics.

Why does the judge return JSON instead of free text?

Because you need to count verdicts at scale. Free text like "This answer is mostly correct but a bit wordy" is impossible to aggregate — you can't compute accuracy from prose. Requiring the judge to emit {"score": "correct" | "incorrect", "reason": "..."} and calling json.loads on the reply means every row reduces to a single categorical label you can sum. The invalid_rate metric exists specifically to catch the case where the judge drifted off the schema.

What's the invalid_rate metric measuring and why does it matter?

invalid_rate = invalid / total is the fraction of judge calls that returned unparseable or schema-violating output. A high invalid_rate doesn't mean your agent is bad — it means your judge prompt, your model choice, or your parsing is bad. Tracking it separately from accuracy prevents you from drawing conclusions about the system under test when the measurement instrument itself is broken. It's a diagnostic on the eval pipeline.

How is the A/B comparison in Step 5 actually meaningful?

Because same dataset, same judge, same harness. You refactor run_agent to take a system_prompt argument, define PROMPT_A (the terse original) and PROMPT_B (a tighter "one short token or word" version), then run the identical scoring pipeline twice. Accuracy_A vs Accuracy_B on the same questions with the same judge isolates the prompt change as the only variable. That's the minimum useful shape of an eval comparison — swap one thing, rerun, compare numbers.

Is this the same thing as NeMo Evaluator?

Same shape, smaller footprint. NeMo Evaluator exposes this pipeline as a managed service: upload a dataset, configure evaluators (accuracy, RAGAS, custom judges), let it run your agent and score every row, get metrics back. This lab hand-builds the equivalent in ~100 lines of Python so you understand what Evaluator is actually doing under the hood. Once the concepts click here, reading NeMo Evaluator's docs becomes straightforward — you already know which fields map to which.

Can the same model act as both the agent and the judge?

Technically yes, and this lab uses Nemotron for both — but it's not ideal for production. A judge that's the same family as the system under test can share biases: it may rate answers as correct that it would have generated itself. In a real eval pipeline you want judges from a different model family, or at least a larger, more capable model than the one you're scoring. The harness you build here is model-agnostic — swap the model parameter on the judge call and you've got a different judge.