Question 1

What does LLM-as-judge mean and why not just compare strings?

Accepted Answer

LLM-as-judge means using a second LLM to evaluate whether a prediction matches a reference on semantic grounds rather than character-by-character. `prediction == reference` fails on natural paraphrases: `"Paris"`, `"the city of Paris"`, and `"It's Paris."` are all correct answers to "What is the capital of France?" but only one string matches. A judge LLM reads the reference, reads the prediction, applies a rubric (e.g., "score correct if the prediction matches the reference in meaning"), and returns a structured verdict you can aggregate into metrics.

Question 2

Why does the judge return JSON instead of free text?

Accepted Answer

Because you need to count verdicts at scale. Free text like `"This answer is mostly correct but a bit wordy"` is impossible to aggregate — you can't compute accuracy from prose. Requiring the judge to emit `{"score": "correct" | "incorrect", "reason": "..."}` and calling `json.loads` on the reply means every row reduces to a single categorical label you can sum. The `invalid_rate` metric exists specifically to catch the case where the judge drifted off the schema.

Question 3

What's the invalid_rate metric measuring and why does it matter?

Accepted Answer

`invalid_rate = invalid / total` is the fraction of judge calls that returned unparseable or schema-violating output. A high invalid_rate doesn't mean your agent is bad — it means your judge prompt, your model choice, or your parsing is bad. Tracking it separately from accuracy prevents you from drawing conclusions about the system under test when the measurement instrument itself is broken. It's a diagnostic on the eval pipeline.

Question 4

How is the A/B comparison in Step 5 actually meaningful?

Accepted Answer

Because same dataset, same judge, same harness. You refactor `run_agent` to take a `system_prompt` argument, define `PROMPT_A` (the terse original) and `PROMPT_B` (a tighter "one short token or word" version), then run the identical scoring pipeline twice. Accuracy_A vs Accuracy_B on the same questions with the same judge isolates the prompt change as the only variable. That's the minimum useful shape of an eval comparison — swap one thing, rerun, compare numbers.

Question 5

Is this the same thing as NeMo Evaluator?

Accepted Answer

Same shape, smaller footprint. NeMo Evaluator exposes this pipeline as a managed service: upload a dataset, configure evaluators (accuracy, RAGAS, custom judges), let it run your agent and score every row, get metrics back. This lab hand-builds the equivalent in ~100 lines of Python so you understand what Evaluator is actually doing under the hood. Once the concepts click here, reading NeMo Evaluator's docs becomes straightforward — you already know which fields map to which.

Question 6

Can the same model act as both the agent and the judge?

Accepted Answer

Technically yes, and this lab uses Nemotron for both — but it's not ideal for production. A judge that's the same family as the system under test can share biases: it may rate answers as correct that it would have generated itself. In a real eval pipeline you want judges from a different model family, or at least a larger, more capable model than the one you're scoring. The harness you build here is model-agnostic — swap the `model` parameter on the judge call and you've got a different judge.

Evaluate an Agent with LLM-as-Judge

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this LLM-as-judge eval lab

Frequently asked questions