Memory & Context Management

Medium 25 min read

Why Memory Matters

Why Memory Matters

The Problem: LLMs are stateless -- each API call starts fresh with no memory of previous interactions. Without memory, agents cannot maintain context across conversations.

The Solution: Memory systems give agents the ability to retain conversation history, store important facts, and recall relevant past experiences -- making them more effective over time.

Real Impact: Memory-enabled agents can handle multi-session workflows, personalize responses, and avoid repeating mistakes.

Real-World Analogy

Think of agent memory like human memory systems:

  • Short-Term = Your working memory during a conversation
  • Long-Term = Facts and knowledge stored for later recall
  • Episodic = Memories of specific past experiences and outcomes
  • Context Window = How much you can hold in mind at once
  • Summarization = Creating mental summaries of long events

Memory Types at a Glance

Conversation Buffer

Store the full conversation history. Simple but grows linearly and can exceed context limits.

Sliding Window

Keep only the last N messages. Prevents context overflow but loses early context.

Summary Memory

Periodically summarize older messages. Preserves key information while saving tokens.

Vector Store Memory

Embed and store all interactions. Retrieve relevant memories via semantic search.

Key Takeaway: Agent memory has three tiers: short-term (conversation history in the context window), long-term (persisted in vector databases or key-value stores), and episodic (records of past task executions for learning). Each tier serves different recall needs and has different cost/latency tradeoffs.

Short-Term Memory

sliding_window.py
class SlidingWindowMemory:
    def __init__(self, max_messages=20):
        self.messages = []
        self.max_messages = max_messages

    def add(self, role, content):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self):
        context = [{"role": "system", "content": self.system_prompt}]
        context.extend(self.messages)
        return context

Long-Term Memory

Memory Architecture
Agent Short-Term Conversation buffer Working Memory Current task context Long-Term Store Vector DB + summaries Bidirectional sync
Output (RAG-based memory)
User: "What did we discuss about the Q3 budget?"
Memory search: query="Q3 budget discussion"
  Found 3 relevant memories (cosine similarity > 0.82):
  1. [2024-03-15] "Agreed to increase marketing budget by 15%"
  2. [2024-03-15] "Engineering headcount frozen at current levels"
  3. [2024-03-20] "Q3 budget approved at $2.1M total"
Agent: "In our previous conversations, we discussed..."

Common Mistake

Wrong: Storing entire conversation histories as long-term memory

Why it fails: Raw conversations contain noise (greetings, clarifications, corrections) that pollute retrieval results. Searching through verbose history returns low-quality matches and wastes context window space.

Instead: Summarize conversations into structured facts before storing: extract key decisions, facts, preferences, and action items. Store these as discrete, searchable memory entries with metadata (date, topic, confidence).

Episodic Memory

vector_memory.py
from chromadb import Client

class VectorMemory:
    def __init__(self):
        self.client = Client()
        self.collection = self.client.get_or_create_collection("agent_memory")

    def store(self, text, metadata=None):
        self.collection.add(
            documents=[text],
            metadatas=[metadata or {}],
            ids=[f"mem_{self.collection.count()}"]
        )

    def recall(self, query, n_results=5):
        results = self.collection.query(
            query_texts=[query], n_results=n_results
        )
        return results["documents"][0]

Context Window Management

StrategyHow It WorksTrade-off
Full HistoryKeep all messagesHits token limits quickly
Sliding WindowKeep last N messagesLoses early context
SummarizationSummarize old messagesLossy, extra LLM cost
RAG RetrievalEmbed and search past messagesSetup complexity
HybridWindow + summary + RAGMost effective but complex
Deep Dive: Context Window Management

When conversations exceed the context window, you must choose what to keep and what to summarize or drop. Strategies: (1) Sliding window -- keep the last N messages. Simple but loses important early context. (2) Summarization -- periodically summarize older messages into a condensed form. Preserves key info but loses nuance. (3) Selective retention -- keep messages that contain tool results, decisions, or user preferences; drop pleasantries and failed attempts. (4) Hybrid -- summarize old context, keep recent messages verbatim, and inject retrieved memories from long-term storage.

Quick Reference

Memory TypeStorageBest For
BufferIn-memory listShort conversations
WindowFixed-size listLong conversations
SummaryCompressed textMulti-session agents
VectorVector databaseKnowledge-heavy agents
EntityStructured storeRelationship tracking