Question 1

Why use a cascade router instead of always calling the best model?

Accepted Answer

Because most production queries don't need the best model. On a balanced workload — factual lookups, simple classification, short explanations mixed with occasional hard reasoning — the largest model is overkill on the majority of requests. A 3-tier cascade typically spends 60–80% less on the same traffic because it serves the easy queries from the 8B and the reasoning-heavy queries from the 253B. The cascade never *reduces* quality on hard queries (those escalate to the large tier anyway); it just stops overpaying on easy ones.

Question 2

How does confidence-based routing actually work?

Accepted Answer

You prompt the model to return structured output of the form `{"answer": "<short>", "confidence": 1-5}` where 5 means "I'm certain" and 1 means "I'd guess." The cascade reads the confidence field and decides whether to accept the tier's answer or escalate. Self-reported confidence isn't perfect — models can be overconfident — but it's well-calibrated enough on factual tasks to route 60–80% of queries correctly, and it costs only one extra field in the response.

Question 3

Why does the cascade's cost count tiers that didn't answer?

Accepted Answer

Because they still ran. If the small tier fires, then the mid-tier fires because confidence was low, then the large tier fires and finally answers — the cascade paid for all three calls. A fair comparison against always-large has to charge the cascade for every tier it invoked, not just the winning one. Step 3 tracks cumulative cost across the whole walk, and Step 4's savings percentage is only honest because of that accounting.

Question 4

What if the small model is overconfident and returns a wrong answer with confidence 5?

Accepted Answer

That's the cascade's main failure mode. Two mitigations: (a) tune the confidence threshold — in this lab you escalate below 4, but you can make it stricter; (b) add a secondary check like an LLM-as-judge on a sample of small-tier decisions. In production you'd also keep a ground-truth eval set and periodically re-measure small-tier accuracy so overconfidence drift gets caught. The lab's Step 4 dataset deliberately mixes easy and hard questions so you see the overconfidence edge cases.

Question 5

Which NIM models does this lab cascade across?

Accepted Answer

Three tiers served through `http://nim-proxy.labs.svc:8080/v1`: small is `meta/llama-3.1-8b-instruct`, mid is `meta/llama-3.3-70b-instruct`, and large is a Llama-3.1 253B-scale tier. They all share the OpenAI-compatible chat completions surface, so switching tiers is a `model=` string change and nothing else. The per-token pricing table in Step 1 is simplified for the lab but structured so you can replace it with real NIM catalog numbers unchanged.

Question 6

Does this pattern generalize beyond just two or three model sizes?

Accepted Answer

Yes. The same logic works for N tiers, for cross-provider cascades (a cheap self-hosted model before a premium API), and for capability-based routing rather than size-based (a code-specialized model before a general one). NeMo Agent Toolkit supports this natively via function groups and workflow-level routing, and many production agent stacks layer cascades on top of an LLM router so cost optimization happens before the agent sees the request. What you build in this lab is the core pattern everything else specializes.

Model Routing & Cost Cascade with NIM

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this cost-cascade routing lab

Frequently asked questions