Deployment & Scaling

Hard 30 min read

Deployment Options

Why Deployment Strategy Matters

The Problem: AI agents that work in notebooks often fail in production due to concurrency issues, cold starts, memory leaks, uncontrolled costs, and lack of observability.

The Solution: Production-grade deployment with containerization, auto-scaling, caching, and comprehensive monitoring ensures agents are reliable, cost-effective, and observable at scale.

Real Impact: Properly deployed agent systems handle 100x traffic spikes while keeping per-interaction costs under $0.05 and maintaining 99.9% uptime.

Real-World Analogy

Think of agent deployment like running a restaurant chain:

  • Docker Container = A standardized kitchen that works anywhere
  • Kubernetes = The franchise management system coordinating all locations
  • Auto-scaling = Opening more registers during the lunch rush
  • Caching = Pre-preparing popular menu items for faster service
  • Monitoring = Health inspectors and customer satisfaction surveys

Deployment Architecture Options

Serverless (AWS Lambda / Cloud Functions)

Pay-per-invocation, auto-scaling, zero infrastructure. Best for low-volume or bursty agent workloads with short execution times.

Container-Based (ECS / Cloud Run / K8s)

Full control over runtime, persistent connections, predictable performance. Best for high-volume agents needing consistent latency.

Managed Platforms (Modal, Replicate)

Specialized ML/AI hosting with GPU access, pre-built scaling, and simplified deployment. Best for rapid prototyping and small teams.

Self-Hosted (VMs / Bare Metal)

Maximum control and data sovereignty. Best for enterprises with strict compliance requirements or existing infrastructure.

Containerization

Production Deployment Architecture
Users Load Balancer Rate Limiting SSL Termination Health Checks Agent Pod 1 Agent Pod 2 Agent Pod 3 Auto-scale Redis Cache PostgreSQL LLM APIs Monitoring
Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run with gunicorn for production
CMD ["gunicorn", "main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]

Scaling Strategies

k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-agent
  template:
    spec:
      containers:
      - name: agent
        image: myregistry/ai-agent:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        env:
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: api-keys
              key: anthropic
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70

Cost Optimization

Cost Reduction Strategies

  • Response Caching: Cache identical queries with Redis (TTL-based) to avoid redundant LLM calls
  • Model Tiering: Use cheaper models for simple tasks, expensive ones only for complex reasoning
  • Prompt Optimization: Shorter, more efficient prompts reduce token costs by 30-50%
  • Batch Processing: Group similar requests to reduce per-request overhead
  • Token Budgets: Set per-user and per-session token limits to prevent runaway costs

Monitoring

Common Pitfall

Problem: Agent costs spiral out of control because there is no per-user or per-session budget enforcement.

Solution: Implement token budgets at multiple levels: per request, per session, per user, and per day. Set up cost alerts that trigger at 50%, 80%, and 100% of budget thresholds.

Quick Reference

StrategyWhen to UseKey Benefit
ServerlessLow/bursty trafficZero idle cost
KubernetesHigh-volume productionFine-grained scaling
Response CachingRepeated queries70% cost reduction
Model TieringMixed complexity40% cost reduction
HPAVariable loadAutomatic scaling
Circuit BreakerExternal API depsGraceful degradation
OpenTelemetryAll deploymentsFull observability