Model Routing & Cost Cascade with NIM
Save 60–80% on inference by cascading queries through cheap → mid → expensive NIM models. Measure real costs via NIM's usage.cost field and compare against an always-large baseline.
What you'll learn
- 1Call three tiers and measure costMost production agents spend money on queries the model didn't need to be big for. *"What's 2+2?"* runs on an 8B model as well as on a 253B one — but the 253B call costs ~50× more. A cascade flips that economics: try the cheap model first, only pay for the expensive one when the cheap one isn't confident.
- 2Have the model self-rate its confidenceTo decide *when to escalate*, you need a cheap-to-compute signal that the small model is uncertain. The easiest one: ask the model itself for a confidence score alongside its answer, as structured JSON.
- 3Build the cascadeNow glue the two pieces together. cascade(question) walks the tiers, stopping as soon as it gets a high-confidence answer.
- 4Measure savings vs always-largeThe cascade's whole point is cost. You'll now prove it works by running both strategies over a mixed-difficulty dataset:
Prerequisites
- Completed `react-agent-nim` or comparable NIM exposure
- Comfort reading/writing small JSON payloads
- Basic idea of LLM tokens and pricing
Exam domains covered
Skills & technologies you'll practice
This intermediate-level ai/ml lab gives you real-world reps across:
What you'll build in this cost-cascade routing lab
Model routing and cost cascades are the single highest-leverage optimisation on most production LLM apps — real teams cut inference spend 60–80% on balanced workloads by serving the easy queries from a small model and reserving the expensive tier for genuinely hard cases. This lab builds a three-tier confidence-aware cascade against NVIDIA NIM endpoints we provision, measures real token-based cost per tier, and benchmarks the cascade against an always-large baseline on a mixed-difficulty question set. You walk away with a working cascade(question) function, cost numbers you can quote, and the mental model for when confidence-based routing wins versus always-use-the-best-model.
The substance is confidence-driven routing. The small tier — meta/llama-3.1-8b-instruct — handles factual recall and classification. The mid tier — meta/llama-3.3-70b-instruct — absorbs moderate reasoning. The large tier — Llama-3.1 253B-scale — is the reserve for long-chain reasoning and edge cases. You prompt each tier to return structured JSON like {"answer": "<short>", "confidence": 1-5}, walk small → mid → large stopping when confidence clears a threshold, and — critically — charge the cascade for every tier it invoked, not just the winning one. You'll see why self-reported confidence is surprisingly well-calibrated on factual tasks, why the main failure mode is overconfident wrong answers (and how LLM-as-judge on a sample catches them), and why this primitive generalises to cross-provider cascades, capability-based routing, and NeMo Agent Toolkit function-group workflows.
Prerequisites: Python, comfort with JSON payloads, a basic feel for token-based LLM pricing, and prior NIM exposure (the react-agent-nim lab works). The hosted environment ships with the OpenAI Python SDK pointed at our managed NIM proxy — all three tiers share the same OpenAI-compatible endpoint, so swapping between them is a model= string change. No GPU provisioning. About 25 minutes of focused work. You leave with per-tier dollar costs on real token counts, a clean structured-output confidence signal, a cascade that short-circuits on high confidence, and a side-by-side savings report against always-large — the shape of number you walk into a finance review with.
Frequently asked questions
Why use a cascade router instead of always calling the best model?
How does confidence-based routing actually work?
{"answer": "<short>", "confidence": 1-5} where 5 means "I'm certain" and 1 means "I'd guess." The cascade reads the confidence field and decides whether to accept the tier's answer or escalate. Self-reported confidence isn't perfect — models can be overconfident — but it's well-calibrated enough on factual tasks to route 60–80% of queries correctly, and it costs only one extra field in the response.Why does the cascade's cost count tiers that didn't answer?
What if the small model is overconfident and returns a wrong answer with confidence 5?
Which NIM models does this lab cascade across?
http://nim-proxy.labs.svc:8080/v1: small is meta/llama-3.1-8b-instruct, mid is meta/llama-3.3-70b-instruct, and large is a Llama-3.1 253B-scale tier. They all share the OpenAI-compatible chat completions surface, so switching tiers is a model= string change and nothing else. The per-token pricing table in Step 1 is simplified for the lab but structured so you can replace it with real NIM catalog numbers unchanged.