Fault tolerance, circuit breaker, retry, and bulkhead patterns
In a distributed system like microservices, failures are inevitable. A service might be slow, unresponsive, or completely down. Without protection, one failing service can crash your entire system!
Imagine a chain of dominoes:
Think of your home's electrical circuit breaker. When there's a problem, it "opens" (breaks the circuit) to prevent damage. Microservices use the same concept!
🟢 Closed (Normal): Requests flow normally
🔴 Open (Failing): Requests are rejected immediately, no calls to failing service
🟡 Half-Open (Testing): Trying limited requests to see if service recovered
import requests
from datetime import datetime, timedelta
class SimpleCircuitBreaker:
def __init__(self, failure_threshold=5):
self.failure_threshold = failure_threshold
self.failure_count = 0
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
self.open_until = None
def call_service(self, url):
# If circuit is OPEN, fail fast
if self.state == "OPEN":
if datetime.now() > self.open_until:
self.state = "HALF_OPEN"
else:
raise Exception("Circuit is OPEN - service unavailable")
try:
# Try to call the service
response = requests.get(url, timeout=3)
# Success! Reset failure count
self.failure_count = 0
self.state = "CLOSED"
return response
except Exception as e:
# Failure!
self.failure_count += 1
# Too many failures? Open the circuit
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
self.open_until = datetime.now() + timedelta(seconds=60)
print("Circuit OPENED - too many failures!")
raise e
# Usage
breaker = SimpleCircuitBreaker()
try:
result = breaker.call_service('http://unreliable-service/api')
except Exception as e:
print("Service call failed:", e)
Build robust microservices that handle failures gracefully.
Let's implement a production-ready circuit breaker with all three states:
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure rate
.waitDurationInOpenState(Duration.ofSeconds(60)) // Stay open for 60s
.slidingWindowSize(10) // Look at last 10 calls
.minimumNumberOfCalls(5) // Need 5 calls before checking
.permittedNumberOfCallsInHalfOpenState(3) // Try 3 calls in half-open
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker breaker = registry.circuitBreaker("paymentService");
// Use circuit breaker
public PaymentResult processPayment(PaymentRequest request) {
return breaker.executeSupplier(() -> {
// This code is protected by circuit breaker
return paymentClient.process(request);
});
}
Don't retry immediately - use increasing delays to give service time to recover.
import time
import random
def retry_with_backoff(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise # Last attempt, give up
# Exponential backoff: 1s, 2s, 4s, 8s, 16s
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed. Retrying in {wait_time:.1f}s...")
time.sleep(wait_time)
# Usage
result = retry_with_backoff(lambda: call_external_api())
Isolate resources to prevent one failing component from consuming all resources.
| Pattern | Purpose | Implementation |
|---|---|---|
| Thread Pool Bulkhead | Separate thread pools per service | Each external service gets own thread pool |
| Semaphore Bulkhead | Limit concurrent calls | Max N concurrent calls to service |
| Resource Bulkhead | Separate resources | Dedicated DB connections, memory |
const { Worker } = require('worker_threads');
class Bulkhead {
constructor(maxConcurrent = 10) {
this.maxConcurrent = maxConcurrent;
this.currentRunning = 0;
this.queue = [];
}
async execute(task) {
// Check if we have capacity
if (this.currentRunning >= this.maxConcurrent) {
// Wait in queue
return new Promise((resolve, reject) => {
this.queue.push({ task, resolve, reject });
});
}
return this._run(task);
}
async _run(task) {
this.currentRunning++;
try {
const result = await task();
return result;
} finally {
this.currentRunning--;
this._processQueue();
}
}
_processQueue() {
if (this.queue.length > 0 && this.currentRunning < this.maxConcurrent) {
const { task, resolve, reject } = this.queue.shift();
this._run(task).then(resolve).catch(reject);
}
}
}
// Usage: Separate bulkheads for different services
const paymentBulkhead = new Bulkhead(5); // Max 5 concurrent payment calls
const shippingBulkhead = new Bulkhead(10); // Max 10 concurrent shipping calls
await paymentBulkhead.execute(() => processPayment(order));
await shippingBulkhead.execute(() => shipOrder(order));
Netflix pioneered resilience patterns with Hystrix (now in maintenance mode, but concepts remain relevant):
Impact: Serves 200M+ subscribers with high availability despite failures
Production-grade resilience patterns and chaos engineering practices.
Combine multiple resilience patterns for robust systems:
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.ratelimiter.RateLimiter;
import io.github.resilience4j.timelimiter.TimeLimiter;
public class ResilientServiceCall {
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final Bulkhead bulkhead;
private final RateLimiter rateLimiter;
private final TimeLimiter timeLimiter;
public Result callService(Request request) {
return Decorators.ofSupplier(() -> makeActualCall(request))
.withCircuitBreaker(circuitBreaker) // Layer 1: Circuit breaker
.withRetry(retry) // Layer 2: Retry with backoff
.withBulkhead(bulkhead) // Layer 3: Limit concurrency
.withRateLimiter(rateLimiter) // Layer 4: Rate limiting
.withTimeLimiter(timeLimiter) // Layer 5: Timeout
.withFallback(this::fallbackResponse) // Layer 6: Fallback
.get();
}
private Result fallbackResponse(Throwable throwable) {
// Return cached data or default response
return cacheService.getLastKnownGood()
.orElse(Result.defaultResult());
}
}
Intentionally inject failures to test resilience:
| Chaos Experiment | What It Tests | Example |
|---|---|---|
| Latency Injection | Slow dependencies | Add 2s delay to 10% of requests |
| Error Injection | Service failures | Return 500 error for 5% of calls |
| Instance Termination | Server crashes | Kill random service instances |
| Network Partition | Network splits | Block communication between services |
| Resource Exhaustion | Memory/CPU limits | Consume 90% of available memory |
# Gremlin Chaos Experiment Configuration
{
"name": "Payment Service Latency Test",
"hypothesis": "System remains functional with 95% success rate even when payment service has 2s latency",
"experiment": {
"target": {
"service": "payment-service",
"percentage": 50 # Affect 50% of instances
},
"attack": {
"type": "latency",
"duration": "10m",
"magnitude": "2000ms"
}
},
"steady_state": {
"metrics": [
{"name": "success_rate", "min": 95},
{"name": "p99_latency", "max": "5000ms"},
{"name": "error_rate", "max": 5}
]
},
"rollback": {
"on_violation": true,
"alerts": ["slack", "pagerduty"]
}
}
You can't fix what you can't see. Proper monitoring is crucial:
from prometheus_client import Counter, Gauge, Histogram
# Circuit breaker metrics
circuit_breaker_state = Gauge(
'circuit_breaker_state',
'Circuit breaker state (0=closed, 1=open, 2=half_open)',
['service']
)
circuit_breaker_failures = Counter(
'circuit_breaker_failures_total',
'Total number of circuit breaker failures',
['service']
)
circuit_breaker_success = Counter(
'circuit_breaker_success_total',
'Total number of successful calls',
['service']
)
service_call_duration = Histogram(
'service_call_duration_seconds',
'Service call duration',
['service'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
# Update metrics
circuit_breaker_state.labels(service='payment').set(1) # Open
circuit_breaker_failures.labels(service='payment').inc()
service_call_duration.labels(service='payment').observe(2.5)
Amazon's approach to building resilient systems:
Philosophy: "Everything fails, all the time" - design for failure, not perfection