Structured Output & Function Calling with NIM
Hosted
Beta

Structured Output & Function Calling with NIM

Get reliable machine-parseable data out of an LLM. Compare prompt-only JSON extraction against the function-calling API, chain two tools, and measure the reliability gap on a real extraction task.

30 min·4 steps·3 domains·Intermediate·ncp-aai

What you'll learn

  1. 1
    Prompt-only JSON extraction (baseline)
    The easiest way to get structured data out of an LLM is to ask for JSON in the prompt. It mostly works — but *mostly* isn't good enough for production, and you'll quantify why in step 4.
  2. 2
    Function calling with a JSON Schema
    Instead of asking the model to write JSON in its reply, you tell it: *"here is a function, here is its schema, call it with the right arguments."* The model then returns a tool_call with arguments already validated against your schema — no regex, no json.loads.
  3. 3
    Chain two tools with validation
    Real agents use more than one tool per turn. A classifier decides *what kind of message this is*, then a specialized extractor does the extraction that fits. This lets you handle multiple schemas without cramming them into one.
  4. 4
    Measure prompt-only vs tools reliability
    Prompt-only JSON extraction works until the message is messy. Tools-mode keeps working because the schema is enforced at the API boundary.

Prerequisites

  • Completed `react-agent-nim` or comparable NIM exposure
  • Basic Python (functions, dataclasses or dicts)
  • Familiarity with JSON Schema

Exam domains covered

Agent DevelopmentTool CallingNVIDIA Platform Implementation

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

Function CallingTool UseJSON SchemaNIMStructured Output

What you'll build in this function-calling lab

Function calling is what separates LLM demoware from production — the moment you need machine-readable output that validates on the first try, prompting the model for JSON stops being good enough. This lab builds the same contact-and-invoice extraction pipeline two ways and measures the reliability gap: prompt-only JSON extraction as the naive baseline, versus schema-enforced function calling via the tools parameter on NVIDIA NIM endpoints we provision. You walk away with a structured-output pattern you can drop into any LangChain agent, intuition for when prompt mode is fine and when it actively fails, and a harness that gives you a concrete valid-vs-broken count on your own data.

The technical substance is the distinction between asking nicely for JSON and enforcing a schema at the API boundary. You define a save_contact tool with a formal JSON Schema, pass it via the tools=[...] parameter, set tool_choice to force the call, and read message.tool_calls[0].function.arguments — already valid against your schema, no regex, no markdown-fence stripping. You then compose two tools — a classifier that returns "contact" | "invoice" and a specialised save_invoice with its own schema — and see why a router-plus-specialist pipeline is cleaner than one fat union schema. The final step runs both strategies over intentionally awkward inputs (trailing commas, apostrophes in names, emoji, multi-line bodies) and surfaces the reliability gap quantitatively.

Prerequisites: Python with dicts or dataclasses, prior exposure to a NIM-backed agent (the react-agent-nim lab is the natural entry point), and a rough mental model of JSON Schema. The hosted environment ships with the OpenAI Python SDK pointed at our managed NIM proxy serving meta/llama-3.3-70b-instruct — same OpenAI-compatible tool_calls surface you'd use against the real API, no keys, no GPU provisioning. About 30 minutes of focused work, ending with a message-by-message report of which extractions parsed cleanly against the schema and which didn't — the same accounting real teams run before picking the default extraction mode for their pipelines.

Frequently asked questions

What's the difference between a tool call and a function call in the OpenAI schema?

They're the same concept; the schema is just layered. A chat completion that uses tools returns message.tool_calls, a list of objects each containing a type: "function" wrapper and a function object with name and arguments (a JSON string). "Function calling" is the older name — the API was originally one function per call — and "tool calling" is the current name that generalizes to multiple callable tools per turn. NIM's OpenAI-compatible endpoints expose both fields, and this lab uses the modern tool_calls shape throughout.

Why is function calling more reliable than prompting for JSON?

Because the schema is enforced on the server, not on the model's prose instincts. When you pass tools=[...] with a JSON Schema, the endpoint constrains generation to produce arguments that match the schema — required fields are present, types are correct, the output parses. Prompt-only extraction depends on the model having seen enough JSON in training to emit valid JSON for your specific shape, and it breaks on edge cases: trailing commas, unescaped quotes in names, markdown code fences, etc. Step 4 measures the gap concretely on a noisy test set.

What does tool_choice control and when should I set it?

tool_choice tells the endpoint how aggressively to call tools. "auto" (the default) lets the model decide whether to emit a tool call or a text reply. "required" forces a tool call. {"type": "function", "function": {"name": "save_contact"}} forces a specific tool. Use "auto" for agents that may or may not need the tool. Use "required" when your extraction pipeline must produce structured output, as in Step 2 — you don't want the model to chat back "sure, here's the info" in prose.

Why classify first, then pick a schema, instead of using one giant schema with all fields?

One fat schema with union types forces the model to reason about record type and field extraction in a single shot, and it makes your validation harder — you end up checking "if type == invoice, these fields should be present; if type == contact, these other fields." A two-step pipeline (classifier → specialized extractor) keeps each call focused on one job, lets you evolve each schema independently, and maps cleanly onto the router → specialist tool pattern real agents use. Step 3 walks you through both tools and the routing shim.

What counts as a "valid" extraction in Step 4's comparison?

A dict that parses cleanly and contains every required field for the record type. An extraction is broken if json.loads fails, if a required field is missing or null, or if a type doesn't match (e.g., phone came back as a number instead of a string). The Step 4 harness counts valid vs broken for both the prompt-only path and the tools path over the same messy input set. The expected outcome is that tools mode maintains a near-perfect valid rate while prompt-only degrades visibly on the awkward inputs.

Does every NIM model support function calling?

Most modern chat-completion NIMs do — the Llama 3.3 70B Instruct and Nemotron reasoning families used in these labs all accept tools and return tool_calls. But not every model in the NIM catalog exposes function calling: some older vision-language models (for example meta-llama/llama-3.2-11b-vision-instruct) will return 404 No endpoints found that support tool use when you pass tools. The vlm-visual-qa lab explores that split explicitly; here, the contact-extraction models are all on the supported list.