Error Handling & Recovery

Medium 22 min read

Error Types in Agent Systems

Why Error Handling Matters

The Problem: Agents interact with external APIs, databases, and LLMs -- all of which can fail. Without proper error handling, a single failure can crash the entire agent workflow.

The Solution: Robust error handling with retry strategies, fallback chains, and graceful degradation keeps agents running reliably even when individual components fail.

Real Impact: Production agents with proper error handling achieve 99%+ uptime versus 80-90% without it.

Real-World Analogy

Think of error handling like a pilot handling in-flight problems:

  • Retry = Toggle the switch again -- maybe it was a momentary glitch
  • Fallback = Switch to the backup system when primary fails
  • Self-Correction = Adjust altitude based on new instrument readings
  • Graceful Degradation = Land at a closer airport instead of the destination
  • Circuit Breaker = Stop trying a system that keeps failing

Common Error Categories

Tool Errors

API timeouts, rate limits, invalid parameters, network failures. Most common and most recoverable.

LLM Errors

Rate limits, context overflow, content filtering, malformed output. Requires retry or model switching.

Logic Errors

Agent enters infinite loops, makes contradictory decisions, or fails to make progress. Needs loop detection.

Data Errors

Invalid input data, schema mismatches, encoding issues. Requires validation and sanitization.

Retry Strategies

Error Handling Decision Tree
Error Occurs Retryable? Fatal Error Retry w/ backoff Try fallback Degrade gracefully Report & stop Yes No
error_handling.py
import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
        except (ToolError, APIError) as e:
            # Return error to agent for self-correction
            return {"error": str(e), "retryable": True}

class FallbackChain:
    def __init__(self, providers):
        self.providers = providers  # [gpt4, claude, local_model]

    def call(self, messages):
        for provider in self.providers:
            try:
                return provider.chat(messages)
            except Exception as e:
                continue  # Try next provider
        raise Exception("All providers failed")

Fallback Chains

Fallback StrategyWhen to UseExample
Model FallbackPrimary model unavailableGPT-4o -> Claude -> Llama
Tool FallbackPrimary tool failsGoogle Search -> Bing -> DuckDuckGo
Strategy FallbackPrimary approach failsRAG -> Web Search -> Cached Answer
Quality FallbackReduce quality for reliabilityDetailed answer -> Summary -> "I cannot help"

Self-Correction

Self-Correction Patterns

  • Output Validation: Check agent output against expected schema, retry if invalid
  • Reflection: Ask the LLM to evaluate its own response and identify errors
  • Test-Driven: Run generated code, feed errors back for fixing
  • Critic Agent: A separate agent evaluates and requests corrections

Graceful Degradation

LevelBehaviorUser Experience
Full ServiceAll systems operationalComplete, detailed response
ReducedSome tools unavailablePartial answer with disclaimer
MinimalOnly LLM availableBest-effort from training data
CachedAll APIs downReturn cached/pre-computed answers
FailureCritical failureClear error message + escalation

Quick Reference

Best PracticeDescriptionImplementation
Exponential BackoffIncrease delay between retriesdelay = base * 2^attempt + jitter
Circuit BreakerStop calling failing servicesTrack failures, open after threshold
TimeoutSet max wait time per tool call30s for APIs, 60s for complex tasks
Max StepsLimit agent loop iterationsPrevent infinite loops, usually 10-25
LoggingRecord all errors and decisionsStructured logs with trace IDs