Model Routing & Cost Cascade with NIM
Hosted
Beta

Model Routing & Cost Cascade with NIM

Save 60–80% on inference by cascading queries through cheap → mid → expensive NIM models. Measure real costs via NIM's usage.cost field and compare against an always-large baseline.

25 min·4 steps·3 domains·Intermediate·ncp-aai

What you'll learn

  1. 1
    Call three tiers and measure cost
    Most production agents spend money on queries the model didn't need to be big for. *"What's 2+2?"* runs on an 8B model as well as on a 253B one — but the 253B call costs ~50× more. A cascade flips that economics: try the cheap model first, only pay for the expensive one when the cheap one isn't confident.
  2. 2
    Have the model self-rate its confidence
    To decide *when to escalate*, you need a cheap-to-compute signal that the small model is uncertain. The easiest one: ask the model itself for a confidence score alongside its answer, as structured JSON.
  3. 3
    Build the cascade
    Now glue the two pieces together. cascade(question) walks the tiers, stopping as soon as it gets a high-confidence answer.
  4. 4
    Measure savings vs always-large
    The cascade's whole point is cost. You'll now prove it works by running both strategies over a mixed-difficulty dataset:

Prerequisites

  • Completed `react-agent-nim` or comparable NIM exposure
  • Comfort reading/writing small JSON payloads
  • Basic idea of LLM tokens and pricing

Exam domains covered

Efficiency and ScalabilityAgent DevelopmentNVIDIA Platform Implementation

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

Cost OptimizationNemotronCascadeNIMModel Routing

What you'll build in this cost-cascade routing lab

Model routing and cost cascades are the single highest-leverage optimisation on most production LLM apps — real teams cut inference spend 60–80% on balanced workloads by serving the easy queries from a small model and reserving the expensive tier for genuinely hard cases. This lab builds a three-tier confidence-aware cascade against NVIDIA NIM endpoints we provision, measures real token-based cost per tier, and benchmarks the cascade against an always-large baseline on a mixed-difficulty question set. You walk away with a working cascade(question) function, cost numbers you can quote, and the mental model for when confidence-based routing wins versus always-use-the-best-model.

The substance is confidence-driven routing. The small tier — meta/llama-3.1-8b-instruct — handles factual recall and classification. The mid tier — meta/llama-3.3-70b-instruct — absorbs moderate reasoning. The large tier — Llama-3.1 253B-scale — is the reserve for long-chain reasoning and edge cases. You prompt each tier to return structured JSON like {"answer": "<short>", "confidence": 1-5}, walk small → mid → large stopping when confidence clears a threshold, and — critically — charge the cascade for every tier it invoked, not just the winning one. You'll see why self-reported confidence is surprisingly well-calibrated on factual tasks, why the main failure mode is overconfident wrong answers (and how LLM-as-judge on a sample catches them), and why this primitive generalises to cross-provider cascades, capability-based routing, and NeMo Agent Toolkit function-group workflows.

Prerequisites: Python, comfort with JSON payloads, a basic feel for token-based LLM pricing, and prior NIM exposure (the react-agent-nim lab works). The hosted environment ships with the OpenAI Python SDK pointed at our managed NIM proxy — all three tiers share the same OpenAI-compatible endpoint, so swapping between them is a model= string change. No GPU provisioning. About 25 minutes of focused work. You leave with per-tier dollar costs on real token counts, a clean structured-output confidence signal, a cascade that short-circuits on high confidence, and a side-by-side savings report against always-large — the shape of number you walk into a finance review with.

Frequently asked questions

Why use a cascade router instead of always calling the best model?

Because most production queries don't need the best model. On a balanced workload — factual lookups, simple classification, short explanations mixed with occasional hard reasoning — the largest model is overkill on the majority of requests. A 3-tier cascade typically spends 60–80% less on the same traffic because it serves the easy queries from the 8B and the reasoning-heavy queries from the 253B. The cascade never reduces quality on hard queries (those escalate to the large tier anyway); it just stops overpaying on easy ones.

How does confidence-based routing actually work?

You prompt the model to return structured output of the form {"answer": "<short>", "confidence": 1-5} where 5 means "I'm certain" and 1 means "I'd guess." The cascade reads the confidence field and decides whether to accept the tier's answer or escalate. Self-reported confidence isn't perfect — models can be overconfident — but it's well-calibrated enough on factual tasks to route 60–80% of queries correctly, and it costs only one extra field in the response.

Why does the cascade's cost count tiers that didn't answer?

Because they still ran. If the small tier fires, then the mid-tier fires because confidence was low, then the large tier fires and finally answers — the cascade paid for all three calls. A fair comparison against always-large has to charge the cascade for every tier it invoked, not just the winning one. Step 3 tracks cumulative cost across the whole walk, and Step 4's savings percentage is only honest because of that accounting.

What if the small model is overconfident and returns a wrong answer with confidence 5?

That's the cascade's main failure mode. Two mitigations: (a) tune the confidence threshold — in this lab you escalate below 4, but you can make it stricter; (b) add a secondary check like an LLM-as-judge on a sample of small-tier decisions. In production you'd also keep a ground-truth eval set and periodically re-measure small-tier accuracy so overconfidence drift gets caught. The lab's Step 4 dataset deliberately mixes easy and hard questions so you see the overconfidence edge cases.

Which NIM models does this lab cascade across?

Three tiers served through http://nim-proxy.labs.svc:8080/v1: small is meta/llama-3.1-8b-instruct, mid is meta/llama-3.3-70b-instruct, and large is a Llama-3.1 253B-scale tier. They all share the OpenAI-compatible chat completions surface, so switching tiers is a model= string change and nothing else. The per-token pricing table in Step 1 is simplified for the lab but structured so you can replace it with real NIM catalog numbers unchanged.

Does this pattern generalize beyond just two or three model sizes?

Yes. The same logic works for N tiers, for cross-provider cascades (a cheap self-hosted model before a premium API), and for capability-based routing rather than size-based (a code-specialized model before a general one). NeMo Agent Toolkit supports this natively via function groups and workflow-level routing, and many production agent stacks layer cascades on top of an LLM router so cost optimization happens before the agent sees the request. What you build in this lab is the core pattern everything else specializes.