Memory & Context Management

Why Memory Matters

The Problem: LLMs are stateless -- each API call starts fresh with no memory of previous interactions. Without memory, agents cannot maintain context across conversations.

The Solution: Memory systems give agents the ability to retain conversation history, store important facts, and recall relevant past experiences -- making them more effective over time.

Real Impact: Memory-enabled agents can handle multi-session workflows, personalize responses, and avoid repeating mistakes.

Real-World Analogy

Think of agent memory like human memory systems:

Short-Term = Your working memory during a conversation
Long-Term = Facts and knowledge stored for later recall
Episodic = Memories of specific past experiences and outcomes
Context Window = How much you can hold in mind at once
Summarization = Creating mental summaries of long events

Memory Types at a Glance

Conversation Buffer

Store the full conversation history. Simple but grows linearly and can exceed context limits.

Sliding Window

Keep only the last N messages. Prevents context overflow but loses early context.

Summary Memory

Periodically summarize older messages. Preserves key information while saving tokens.

Vector Store Memory

Embed and store all interactions. Retrieve relevant memories via semantic search.

Key Takeaway: Agent memory has three tiers: short-term (conversation history in the context window), long-term (persisted in vector databases or key-value stores), and episodic (records of past task executions for learning). Each tier serves different recall needs and has different cost/latency tradeoffs.

Short-Term Memory

sliding_window.py

class SlidingWindowMemory:
    def __init__(self, max_messages=20):
        self.messages = []
        self.max_messages = max_messages

    def add(self, role, content):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self):
        context = [{"role": "system", "content": self.system_prompt}]
        context.extend(self.messages)
        return context

Long-Term Memory

Memory Architecture

Output (RAG-based memory)

User: "What did we discuss about the Q3 budget?"
Memory search: query="Q3 budget discussion"
  Found 3 relevant memories (cosine similarity > 0.82):
  1. [2024-03-15] "Agreed to increase marketing budget by 15%"
  2. [2024-03-15] "Engineering headcount frozen at current levels"
  3. [2024-03-20] "Q3 budget approved at $2.1M total"
Agent: "In our previous conversations, we discussed..."

Common Mistake

Wrong: Storing entire conversation histories as long-term memory

Why it fails: Raw conversations contain noise (greetings, clarifications, corrections) that pollute retrieval results. Searching through verbose history returns low-quality matches and wastes context window space.

Instead: Summarize conversations into structured facts before storing: extract key decisions, facts, preferences, and action items. Store these as discrete, searchable memory entries with metadata (date, topic, confidence).

Episodic Memory

vector_memory.py

from chromadb import Client

class VectorMemory:
    def __init__(self):
        self.client = Client()
        self.collection = self.client.get_or_create_collection("agent_memory")

    def store(self, text, metadata=None):
        self.collection.add(
            documents=[text],
            metadatas=[metadata or {}],
            ids=[f"mem_{self.collection.count()}"]
        )

    def recall(self, query, n_results=5):
        results = self.collection.query(
            query_texts=[query], n_results=n_results
        )
        return results["documents"][0]

Context Window Management

Strategy	How It Works	Trade-off
Full History	Keep all messages	Hits token limits quickly
Sliding Window	Keep last N messages	Loses early context
Summarization	Summarize old messages	Lossy, extra LLM cost
RAG Retrieval	Embed and search past messages	Setup complexity
Hybrid	Window + summary + RAG	Most effective but complex

Deep Dive: Context Window Management

When conversations exceed the context window, you must choose what to keep and what to summarize or drop. Strategies: (1) Sliding window -- keep the last N messages. Simple but loses important early context. (2) Summarization -- periodically summarize older messages into a condensed form. Preserves key info but loses nuance. (3) Selective retention -- keep messages that contain tool results, decisions, or user preferences; drop pleasantries and failed attempts. (4) Hybrid -- summarize old context, keep recent messages verbatim, and inject retrieved memories from long-term storage.

Quick Reference

Memory Type	Storage	Best For
Buffer	In-memory list	Short conversations
Window	Fixed-size list	Long conversations
Summary	Compressed text	Multi-session agents
Vector	Vector database	Knowledge-heavy agents
Entity	Structured store	Relationship tracking

Why Memory Matters

Why Memory Matters

Real-World Analogy

Memory Types at a Glance

Conversation Buffer

Sliding Window

Summary Memory

Vector Store Memory

Short-Term Memory

Long-Term Memory

Common Mistake

Episodic Memory

Context Window Management

Quick Reference

Related Topics