What is RAG?
Why RAG Matters for Agents
The Problem: LLMs have a knowledge cutoff and cannot access private data. They hallucinate when asked about information outside their training data.
The Solution: Retrieval-Augmented Generation (RAG) lets agents search a knowledge base, retrieve relevant documents, and use them as context for generating accurate, grounded answers.
Real Impact: RAG reduces hallucination by up to 70% and enables agents to work with proprietary, real-time, and domain-specific information.
Real-World Analogy
Think of RAG like a librarian helping with research:
- Query = Your research question
- Embedding = Understanding the meaning of your question
- Search = Librarian finding relevant books and sections
- Context = The relevant passages placed on your desk
- Generate = Writing your answer using those passages
RAG Components
Document Chunking
Split documents into meaningful chunks -- by paragraph, semantic boundaries, or fixed token counts.
Embedding
Convert text chunks into vector representations that capture semantic meaning for similarity search.
Vector Search
Find the most relevant chunks by comparing query embedding against stored document embeddings.
Context Injection
Insert retrieved chunks into the LLM prompt as context, enabling grounded and accurate generation.
Embedding & Indexing
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("docs")
# Index documents
def index_documents(documents):
for i, doc in enumerate(documents):
collection.add(
documents=[doc["text"]],
metadatas=[{"source": doc["source"]}],
ids=[f"doc_{i}"]
)
# Query with RAG
def rag_query(question, n_results=3):
# Retrieve relevant docs
results = collection.query(
query_texts=[question], n_results=n_results
)
context = "\n".join(results["documents"][0])
# Generate with context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer using this context:\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Retrieval Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Dense Retrieval | Semantic search with embeddings | Meaning-based queries |
| Sparse Retrieval | Keyword matching (BM25) | Exact term matching |
| Hybrid Search | Combine dense + sparse scores | Best of both approaches |
| Reranking | Score top results with cross-encoder | Precision-critical applications |
| Multi-query | Generate multiple query variants | Ambiguous user questions |
RAG as an Agent Tool
RAG as a Tool
- Search Tool: Agent decides when to search the knowledge base
- Multi-source: Agent can search different collections for different topics
- Iterative Retrieval: Agent can refine its search based on initial results
- Verification: Agent can cross-reference retrieved info with other tools
Advanced RAG Patterns
| Pattern | Description | Improvement |
|---|---|---|
| Parent-Child | Index small chunks, retrieve parent docs | Better context coherence |
| HyDE | Generate hypothetical doc, then search | Better for abstract queries |
| Contextual Compression | Extract only relevant parts of chunks | Reduce noise in context |
| Agentic RAG | Agent controls retrieval strategy | Adaptive to query complexity |
| Graph RAG | Use knowledge graphs + vectors | Better for relational queries |
Quick Reference
| Component | Options | Recommendation |
|---|---|---|
| Embedding Model | OpenAI, Cohere, BGE, E5 | text-embedding-3-small for cost |
| Vector DB | Pinecone, Chroma, Weaviate, Qdrant | Chroma for local dev |
| Chunk Size | 256-2048 tokens | 512 tokens as starting point |
| Overlap | 0-25% of chunk size | 10-20% to maintain context |
| Top-K | 1-20 results | 3-5 for most use cases |