Agent Memory & Persistence
Build a sales intelligence assistant that remembers — short-term conversation state with LangGraph checkpointer, long-term facts in Milvus, and reflection loops that auto-extract knowledge. Learn the memory architecture every production agent needs.
What you'll learn
- 1The Problem: A Stateless AgentThe default LangGraph agent is stateless. Each agent.invoke() call starts fresh — the agent has no knowledge of what was said before. For a sales assistant, this means:
- 2Short-Term Memory with CheckpointerThe fix for step 1's problem is a checkpointer — a component that saves the graph state after each step, so the next invoke can resume from where you left off. Combined with a thread_id, you get per-conversation memory.
- 3Long-Term Memory with MilvusThe checkpointer solves *within*-conversation memory. But what about across conversations?
- 4Memory-Aware AgentNow we wire the memory tools from step 3 into an agent. The agent gets:
- 5Reflection: Auto-Extracting FactsRelying on the agent to call save_memory at the right moment is unreliable. Sometimes it remembers to save, sometimes not. In production, the best pattern is reflection — a separate step that reads recent conversation turns and extracts durable facts automatically.
- 6Putting It Together: The Sales AssistantYou now have every piece of a production-grade memory architecture. Time to assemble them into one system:
Prerequisites
- Completed Lab 1 (ReAct Agent with NIM) or equivalent
- Completed Lab 2 (RAG Pipeline with NIM) — we reuse Milvus for long-term memory
- Basic LangGraph knowledge (create_agent, StateGraph, thread_id)
Exam domains covered
Skills & technologies you'll practice
This intermediate-level ai/ml lab gives you real-world reps across:
What you'll build in this agent memory lab
Agent memory is the feature that turns a chatbot into an assistant — and every production LLM app hits the wall the moment users complain 'I told it about Acme on Monday, Friday it has no idea who Acme is.' This lab assembles the two-tier memory architecture every serious agent team ships: short-term state via a LangGraph MemorySaver checkpointer keyed by thread_id, long-term semantic memory backed by Milvus, and a reflection loop that auto-extracts durable facts so the agent doesn't have to remember to save them. You finish with a sales-intelligence assistant that passes the Monday-call / Friday-recall test across fresh threads, running on NVIDIA NIM endpoints we provision.
The substance is the split between short-term (thread-scoped, exact message history, cheap) and long-term (user-scoped, semantic, permanent, cross-session). You start by proving the stateless baseline fails, then add a checkpointer keyed by thread_id so conversation state survives multiple invoke() calls. Long-term memory is save_memory(fact, category) writing embedded records into Milvus and search_memory(query, top_k) doing cosine similarity over NVIDIA NIM text-embedding vectors. The reflection loop — a separate LLM pass that reads recent turns and extracts anchor facts — is what makes it robust: relying on the agent to call save_memory mid-conversation is brittle because the model under user-facing pressure forgets side-channel tool calls about as often as not. This is the exact pattern real products ship.
Prerequisites: Python, the react-agent-nim and rag-pipeline-nim labs (you reuse Milvus), and comfort with create_agent, StateGraph, and thread_id. The hosted environment ships with LangGraph, the LangChain NIM integration, and pymilvus (Milvus Lite) preinstalled, running against our managed NIM proxy serving meta/llama-3.3-70b-instruct and the NVIDIA text-embedding model — no keys, no GPU provisioning. About 35 minutes of focused work. You leave with a stateless baseline that demonstrates the failure mode, a thread-scoped agent that recalls within a conversation, a Milvus-backed long-term store, a reflection pass that extracts durable facts, and a composed assistant that holds the whole thing together.
Frequently asked questions
Why use two different memory stores instead of one?
What is thread_id in LangGraph and how does it relate to users?
thread_id in LangGraph and how does it relate to users?thread_id is the key the checkpointer uses to load and save state for one conversation. Same thread_id = same conversation history; new thread_id = fresh context. It is not the same as a user ID — a single user typically has many threads (one per conversation, ticket, session) and long-term memory is keyed by user across all of them. The lab wires both: the checkpointer scopes by thread_id, save_memory stores facts in Milvus under the user's id so Friday's brand-new thread can still retrieve Monday's notes.Why is reflection better than asking the LLM to call save_memory itself?
save_memory itself?save_memory is a non-load-bearing step that the model skips about as often as not. Reflection decouples the save from the chat: after each turn a separate LLM pass reads the most recent few exchanges with a prompt like 'extract durable facts about clients, deals, or preferences mentioned above' and writes them to Milvus without the main agent ever having to notice. It's slower and more tokens but dramatically more reliable.How do I prevent long-term memory from filling with garbage?
save_memory takes a category argument specifically so you can filter by type at query time.Why Milvus instead of pgvector or Chroma?
What exactly does the reflection step write?
{fact, category} objects extracted from the last few turns by a structured-output LLM call. Typical categories the lab prompts for are client_context ('Sarah is VP at Acme'), deal_state ('Acme is evaluating the Pro tier'), objection ('Sarah is worried about integration time'), and preference ('Rep prefers SOC 2 one-pagers for security-sensitive buyers'). Each fact becomes a Milvus row with the category as metadata, which lets search_memory filter by type ('only pull objections for this account before the call') in addition to semantic match.