Agent Patterns: ReAct vs Tool Calling vs Plan-and-Execute
Hosted
Beta

Agent Patterns: ReAct vs Tool Calling vs Plan-and-Execute

Build the same SaaS customer support agent three different ways — ReAct, direct tool calling, and plan-and-execute — then compare them on speed, reasoning quality, and reliability to learn when to use each pattern in production.

35 min·6 steps·3 domains·Intermediate·ncp-aai

What you'll learn

  1. 1
    The Support Tools
    All three agent patterns will use the same tools — this is critical for a fair comparison. The tools represent real SaaS support operations:
  2. 2
    Pattern 1: ReAct Agent
    ReAct is the most widely used agent pattern. The LLM follows a loop:
  3. 3
    Pattern 2: Direct Tool Calling
    Direct tool calling is simpler than ReAct. Instead of an explicit Thought/Action/Observation loop, you:
  4. 4
    Pattern 3: Plan-and-Execute (ReWOO)
    ReWOO (Reasoning WithOut Observation) flips the ReAct pattern on its head. Instead of reasoning after each tool call, the LLM creates a complete plan upfront, then executes all steps in batch, and finally synthesizes the results.
  5. 5
    Benchmark: Comparing the Patterns
    You now have three agents that solve the same problem differently. Let's run them on identical queries and measure what matters in production: speed, cost (LLM calls), and answer quality.
  6. 6
    Decision Framework: When to Use Which Pattern
    You've built and benchmarked all three patterns. Now let's formalize the decision into a reusable framework — the kind of analysis the NCP-AAI exam expects.

Prerequisites

  • Completed Lab 1 (ReAct Agent with NIM) or equivalent
  • Completed Lab 2 (RAG Pipeline with NIM) — needed for Milvus vector search
  • Understanding of tool calling (bind_tools, ToolMessage)
  • Basic LangGraph knowledge (StateGraph, nodes, edges)

Exam domains covered

Agent Architecture and DesignCognition, Planning, and MemoryAgent Development

What you'll build in this agent patterns lab

Picking the right agent pattern is the difference between a 400ms support bot and a 12-second one that burns three times the tokens per query — and it's the single most consequential architecture decision on any LLM app team. This lab hands you the same SaaS support agent built three ways — ReAct, direct tool calling, and ReWOO plan-and-execute — plus a live benchmark that surfaces the trade in wall-clock latency, LLM call count, and answer quality. You walk away with working implementations of all three reasoning strategies, a clear mental model of when each wins, and a decision framework you can point at in a design review. The whole thing runs on NVIDIA NIM endpoints we provision, so there's no key management, no GPU pod, just code.

Pattern 1 is a canonical ReAct agent driven by a LangGraph StateGraph — Thought, Action, Observation, loop — the right default when steps depend on each other and the plan has to adapt mid-execution. Pattern 2 is direct tool calling via llm.bind_tools(...): no reasoning trace, the LLM emits tool calls until it returns plain text, and it dominates on single-hop queries. Pattern 3 is ReWOO — a planner commits to the full tool sequence upfront, an executor runs them in batch (parallel where possible), and a synthesizer merges observations into the answer; the win is on independent lookups that can fan out. All three share one tool layer — check_account(email), search_kb(query) backed by Milvus similarity search, create_ticket(subject, body) — so the benchmark variable is the reasoning loop, not the environment.

Prerequisites: Python, prior exposure to a ReAct agent, and a rough feel for LangGraph — @tool, bind_tools, ToolMessage, StateGraph. The hosted environment ships with LangGraph, langchain-nvidia-ai-endpoints, pymilvus, and the benchmark harness preinstalled; every call hits our managed NIM proxy running meta/llama-3.3-70b-instruct. About 35 minutes of focused work, finishing with a benchmark run that prints ReAct's LLM-call count, tool-calling's single-hop latency win, and ReWOO's parallel-step speedup side by side — the kind of number you quote in an architecture doc.

Frequently asked questions

Isn't ReAct strictly better than direct tool calling?

No — ReAct trades latency and tokens for reasoning transparency and adaptive planning, and that trade is only worth it when the extra reasoning pays off. A query like 'What plan is alice@startup.io on?' needs one tool call and no reasoning; ReAct will still emit a Thought explaining that it needs to look up the account, then an Action, then an Observation, then a Thought summarising the answer. That's four LLM round-trips to answer a question tool-calling solves in two. Save ReAct for queries that actually branch, like 'if she's on Pro, tell her about the Enterprise upgrade path; if she's on Enterprise, check her seat usage.'

What's the difference between ReWOO and plan-and-execute?

ReWOO (Reasoning WithOut Observation) is the specific plan-and-execute variant the lab uses. Its signature is that the planner commits to every tool call upfront without seeing any observations, the executor runs them in batch, and the synthesizer produces the final answer in one more LLM call. Classical plan-and-execute lets the planner re-plan between execution phases based on observations, which is more robust but slower. ReWOO wins on independent tool calls that can run in parallel — the three-calls-in-2x-wall-clock speedup is real — and loses when a later step's input depends on an earlier step's output.

Why benchmark all three on the same tools?

Because pattern comparisons only matter if the confounds are controlled. If the ReAct agent has a richer knowledge base than the tool-calling agent, you've measured the knowledge base, not the pattern. The lab's check_account, search_kb, and create_ticket are a single module imported by every agent — identical behaviour, identical latency, identical Milvus index. The differences you see in the benchmark trace back to the agent's decision-making loop, which is the whole point of the comparison.

How do I score 'answer quality' fairly across patterns?

The lab uses LLM-as-judge: a separate grader LLM receives the query, the gold answer, and the candidate answer, and returns a 0–5 score plus reasoning. That's imperfect but reproducible, and it's standard practice for agent evaluation. For production you'd combine LLM-as-judge with deterministic checks (did the agent call create_ticket for ticket-creation queries? did it cite the right KB article?) and some human review on a sample. The lab shows the pattern so you can swap in your own rubric.

Which pattern should I pick for a production customer-support agent?

Usually direct tool calling, with ReAct as a fallback for complex branching queries. Real support traffic is dominated by single-hop intents — 'what's my plan', 'reset my API key', 'when does my trial end' — and tool calling answers those in one LLM round-trip. Route the ~5% of multi-hop, branching, conditional queries to ReAct; route the rare 'audit everything about this account' queries (many independent lookups) to ReWOO. The decision-framework step formalises exactly this routing logic.

Can I mix patterns inside one agent?

Yes, and it is often the right answer. Pattern 2 (direct tool calling) as the outer loop, with a fallback into Pattern 1 (ReAct) when the query classifier flags the request as 'needs multi-step reasoning', and a subroutine of Pattern 3 (ReWOO) when you detect a burst of independent sub-queries. That's essentially what 'agentic' systems in production look like under the hood. The lab's final decision framework is deliberately structured so you could implement exactly that routing on top of the three implementations you just built.