RAG for Agents

Medium 28 min read

What is RAG?

Why RAG Matters for Agents

The Problem: LLMs have a knowledge cutoff and cannot access private data. They hallucinate when asked about information outside their training data.

The Solution: Retrieval-Augmented Generation (RAG) lets agents search a knowledge base, retrieve relevant documents, and use them as context for generating accurate, grounded answers.

Real Impact: RAG reduces hallucination by up to 70% and enables agents to work with proprietary, real-time, and domain-specific information.

Real-World Analogy

Think of RAG like a librarian helping with research:

  • Query = Your research question
  • Embedding = Understanding the meaning of your question
  • Search = Librarian finding relevant books and sections
  • Context = The relevant passages placed on your desk
  • Generate = Writing your answer using those passages

RAG Components

Document Chunking

Split documents into meaningful chunks -- by paragraph, semantic boundaries, or fixed token counts.

Embedding

Convert text chunks into vector representations that capture semantic meaning for similarity search.

Vector Search

Find the most relevant chunks by comparing query embedding against stored document embeddings.

Context Injection

Insert retrieved chunks into the LLM prompt as context, enabling grounded and accurate generation.

Embedding & Indexing

RAG Pipeline
Query user question Embed vectorize Search find similar Context top chunks Generate LLM answer
rag_pipeline.py
from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("docs")

# Index documents
def index_documents(documents):
    for i, doc in enumerate(documents):
        collection.add(
            documents=[doc["text"]],
            metadatas=[{"source": doc["source"]}],
            ids=[f"doc_{i}"]
        )

# Query with RAG
def rag_query(question, n_results=3):
    # Retrieve relevant docs
    results = collection.query(
        query_texts=[question], n_results=n_results
    )
    context = "\n".join(results["documents"][0])

    # Generate with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

Retrieval Strategies

StrategyHow It WorksBest For
Dense RetrievalSemantic search with embeddingsMeaning-based queries
Sparse RetrievalKeyword matching (BM25)Exact term matching
Hybrid SearchCombine dense + sparse scoresBest of both approaches
RerankingScore top results with cross-encoderPrecision-critical applications
Multi-queryGenerate multiple query variantsAmbiguous user questions

RAG as an Agent Tool

RAG as a Tool

  • Search Tool: Agent decides when to search the knowledge base
  • Multi-source: Agent can search different collections for different topics
  • Iterative Retrieval: Agent can refine its search based on initial results
  • Verification: Agent can cross-reference retrieved info with other tools

Advanced RAG Patterns

PatternDescriptionImprovement
Parent-ChildIndex small chunks, retrieve parent docsBetter context coherence
HyDEGenerate hypothetical doc, then searchBetter for abstract queries
Contextual CompressionExtract only relevant parts of chunksReduce noise in context
Agentic RAGAgent controls retrieval strategyAdaptive to query complexity
Graph RAGUse knowledge graphs + vectorsBetter for relational queries

Quick Reference

ComponentOptionsRecommendation
Embedding ModelOpenAI, Cohere, BGE, E5text-embedding-3-small for cost
Vector DBPinecone, Chroma, Weaviate, QdrantChroma for local dev
Chunk Size256-2048 tokens512 tokens as starting point
Overlap0-25% of chunk size10-20% to maintain context
Top-K1-20 results3-5 for most use cases