Error Types in Agent Systems
Why Error Handling Matters
The Problem: Agents interact with external APIs, databases, and LLMs -- all of which can fail. Without proper error handling, a single failure can crash the entire agent workflow.
The Solution: Robust error handling with retry strategies, fallback chains, and graceful degradation keeps agents running reliably even when individual components fail.
Real Impact: Production agents with proper error handling achieve 99%+ uptime versus 80-90% without it.
Real-World Analogy
Think of error handling like a pilot handling in-flight problems:
- Retry = Toggle the switch again -- maybe it was a momentary glitch
- Fallback = Switch to the backup system when primary fails
- Self-Correction = Adjust altitude based on new instrument readings
- Graceful Degradation = Land at a closer airport instead of the destination
- Circuit Breaker = Stop trying a system that keeps failing
Common Error Categories
Tool Errors
API timeouts, rate limits, invalid parameters, network failures. Most common and most recoverable.
LLM Errors
Rate limits, context overflow, content filtering, malformed output. Requires retry or model switching.
Logic Errors
Agent enters infinite loops, makes contradictory decisions, or fails to make progress. Needs loop detection.
Data Errors
Invalid input data, schema mismatches, encoding issues. Requires validation and sanitization.
Retry Strategies
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return func()
except RateLimitError:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
except (ToolError, APIError) as e:
# Return error to agent for self-correction
return {"error": str(e), "retryable": True}
class FallbackChain:
def __init__(self, providers):
self.providers = providers # [gpt4, claude, local_model]
def call(self, messages):
for provider in self.providers:
try:
return provider.chat(messages)
except Exception as e:
continue # Try next provider
raise Exception("All providers failed")
Fallback Chains
| Fallback Strategy | When to Use | Example |
|---|---|---|
| Model Fallback | Primary model unavailable | GPT-4o -> Claude -> Llama |
| Tool Fallback | Primary tool fails | Google Search -> Bing -> DuckDuckGo |
| Strategy Fallback | Primary approach fails | RAG -> Web Search -> Cached Answer |
| Quality Fallback | Reduce quality for reliability | Detailed answer -> Summary -> "I cannot help" |
Self-Correction
Self-Correction Patterns
- Output Validation: Check agent output against expected schema, retry if invalid
- Reflection: Ask the LLM to evaluate its own response and identify errors
- Test-Driven: Run generated code, feed errors back for fixing
- Critic Agent: A separate agent evaluates and requests corrections
Graceful Degradation
| Level | Behavior | User Experience |
|---|---|---|
| Full Service | All systems operational | Complete, detailed response |
| Reduced | Some tools unavailable | Partial answer with disclaimer |
| Minimal | Only LLM available | Best-effort from training data |
| Cached | All APIs down | Return cached/pre-computed answers |
| Failure | Critical failure | Clear error message + escalation |
Quick Reference
| Best Practice | Description | Implementation |
|---|---|---|
| Exponential Backoff | Increase delay between retries | delay = base * 2^attempt + jitter |
| Circuit Breaker | Stop calling failing services | Track failures, open after threshold |
| Timeout | Set max wait time per tool call | 30s for APIs, 60s for complex tasks |
| Max Steps | Limit agent loop iterations | Prevent infinite loops, usually 10-25 |
| Logging | Record all errors and decisions | Structured logs with trace IDs |