Agent Patterns: ReAct vs Tool Calling vs Plan-and-Execute
Build the same SaaS customer support agent three different ways — ReAct, direct tool calling, and plan-and-execute — then compare them on speed, reasoning quality, and reliability to learn when to use each pattern in production.
What you'll learn
- 1The Support ToolsAll three agent patterns will use the same tools — this is critical for a fair comparison. The tools represent real SaaS support operations:
- 2Pattern 1: ReAct AgentReAct is the most widely used agent pattern. The LLM follows a loop:
- 3Pattern 2: Direct Tool CallingDirect tool calling is simpler than ReAct. Instead of an explicit Thought/Action/Observation loop, you:
- 4Pattern 3: Plan-and-Execute (ReWOO)ReWOO (Reasoning WithOut Observation) flips the ReAct pattern on its head. Instead of reasoning after each tool call, the LLM creates a complete plan upfront, then executes all steps in batch, and finally synthesizes the results.
- 5Benchmark: Comparing the PatternsYou now have three agents that solve the same problem differently. Let's run them on identical queries and measure what matters in production: speed, cost (LLM calls), and answer quality.
- 6Decision Framework: When to Use Which PatternYou've built and benchmarked all three patterns. Now let's formalize the decision into a reusable framework — the kind of analysis the NCP-AAI exam expects.
Prerequisites
- Completed Lab 1 (ReAct Agent with NIM) or equivalent
- Completed Lab 2 (RAG Pipeline with NIM) — needed for Milvus vector search
- Understanding of tool calling (bind_tools, ToolMessage)
- Basic LangGraph knowledge (StateGraph, nodes, edges)
Exam domains covered
What you'll build in this agent patterns lab
Picking the right agent pattern is the difference between a 400ms support bot and a 12-second one that burns three times the tokens per query — and it's the single most consequential architecture decision on any LLM app team. This lab hands you the same SaaS support agent built three ways — ReAct, direct tool calling, and ReWOO plan-and-execute — plus a live benchmark that surfaces the trade in wall-clock latency, LLM call count, and answer quality. You walk away with working implementations of all three reasoning strategies, a clear mental model of when each wins, and a decision framework you can point at in a design review. The whole thing runs on NVIDIA NIM endpoints we provision, so there's no key management, no GPU pod, just code.
Pattern 1 is a canonical ReAct agent driven by a LangGraph StateGraph — Thought, Action, Observation, loop — the right default when steps depend on each other and the plan has to adapt mid-execution. Pattern 2 is direct tool calling via llm.bind_tools(...): no reasoning trace, the LLM emits tool calls until it returns plain text, and it dominates on single-hop queries. Pattern 3 is ReWOO — a planner commits to the full tool sequence upfront, an executor runs them in batch (parallel where possible), and a synthesizer merges observations into the answer; the win is on independent lookups that can fan out. All three share one tool layer — check_account(email), search_kb(query) backed by Milvus similarity search, create_ticket(subject, body) — so the benchmark variable is the reasoning loop, not the environment.
Prerequisites: Python, prior exposure to a ReAct agent, and a rough feel for LangGraph — @tool, bind_tools, ToolMessage, StateGraph. The hosted environment ships with LangGraph, langchain-nvidia-ai-endpoints, pymilvus, and the benchmark harness preinstalled; every call hits our managed NIM proxy running meta/llama-3.3-70b-instruct. About 35 minutes of focused work, finishing with a benchmark run that prints ReAct's LLM-call count, tool-calling's single-hop latency win, and ReWOO's parallel-step speedup side by side — the kind of number you quote in an architecture doc.
Frequently asked questions
Isn't ReAct strictly better than direct tool calling?
What's the difference between ReWOO and plan-and-execute?
Why benchmark all three on the same tools?
check_account, search_kb, and create_ticket are a single module imported by every agent — identical behaviour, identical latency, identical Milvus index. The differences you see in the benchmark trace back to the agent's decision-making loop, which is the whole point of the comparison.How do I score 'answer quality' fairly across patterns?
create_ticket for ticket-creation queries? did it cite the right KB article?) and some human review on a sample. The lab shows the pattern so you can swap in your own rubric.