Guardrails & Safety

Hard 28 min read

Safety Overview

Why Guardrails Matter

The Problem: AI agents with tool access can cause real-world harm -- deleting files, sending emails, executing dangerous code, or leaking sensitive data if not properly constrained.

The Solution: Layered guardrails that filter inputs, validate outputs, sandbox tool execution, and enforce rate limits create defense-in-depth protection for agent systems.

Real Impact: Organizations with comprehensive guardrails report 95% fewer safety incidents and significantly faster regulatory approval for AI deployments.

Real-World Analogy

Think of guardrails like airport security:

  • Input Guardrails = Security screening before entering the terminal
  • Output Guardrails = Customs inspection before leaving
  • Tool Permissions = Restricted areas requiring special clearance
  • Rate Limiting = Boarding gates controlling passenger flow
  • Audit Logs = CCTV recordings of everything that happens

Defense-in-Depth Layers

Input Filtering

Detect and block prompt injection, jailbreak attempts, and malicious inputs before they reach the agent.

Output Validation

Check agent outputs for PII leakage, harmful content, hallucinated facts, and format compliance.

Tool Sandboxing

Restrict which tools agents can use, with what parameters, and with resource limits on execution.

Rate Limiting

Control how many actions, API calls, and tokens an agent can consume per session or time window.

Input Guardrails

Guardrails Architecture
Input Input Filter Injection Detection Content Moderation PII Redaction Agent Output Filter PII Detection Toxicity Check Fact Verification Output
guardrails.py
import re
from typing import Optional

class InputGuardrails:
    """Filter and validate agent inputs."""

    INJECTION_PATTERNS = [
        r"ignore (?:all |any )?(?:previous |prior )?instructions",
        r"you are now (?:a |an )",
        r"system prompt",
        r"reveal your (?:instructions|prompt|rules)",
    ]

    @staticmethod
    def detect_injection(text: str) -> bool:
        """Check for prompt injection attempts."""
        for pattern in InputGuardrails.INJECTION_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                return True
        return False

    @staticmethod
    def redact_pii(text: str) -> str:
        """Redact PII from input text."""
        # Email addresses
        text = re.sub(r"[\w.-]+@[\w.-]+\.\w+", "[EMAIL_REDACTED]", text)
        # Phone numbers
        text = re.sub(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE_REDACTED]", text)
        # SSN
        text = re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]", text)
        return text

    @staticmethod
    def validate(text: str) -> tuple[bool, str]:
        if InputGuardrails.detect_injection(text):
            return False, "Potential prompt injection detected"
        if len(text) > 10000:
            return False, "Input exceeds maximum length"
        return True, InputGuardrails.redact_pii(text)

Output Guardrails

Output Validation Strategies

  • PII Detection: Scan outputs for emails, phone numbers, SSNs before sending to user
  • Content Moderation: Use classifier models to detect toxic, harmful, or inappropriate content
  • Schema Validation: Ensure structured outputs match expected JSON schemas
  • Citation Checking: Verify that claims are supported by provided sources

Tool Permissions

tool_permissions.py
class ToolPermissions:
    """Control which tools agents can use."""

    PERMISSION_LEVELS = {
        "read_only": ["search", "read_file", "get_weather"],
        "standard": ["search", "read_file", "write_file", "send_email"],
        "admin": ["search", "read_file", "write_file", "delete_file", "execute_code"],
    }

    @staticmethod
    def check_permission(tool_name, level="standard"):
        allowed = ToolPermissions.PERMISSION_LEVELS.get(level, [])
        if tool_name not in allowed:
            raise PermissionError(
                f"Tool '{tool_name}' not allowed at '{level}' level"
            )
        return True

Responsible AI

Common Pitfall

Problem: Guardrails are added as an afterthought and can be bypassed through multi-turn conversations or indirect prompt injection via tool outputs.

Solution: Implement guardrails at every layer -- input, agent reasoning, tool execution, and output. Use a separate guardrails model that evaluates each step independently from the agent itself.

Quick Reference

Guardrail TypePurposeImplementation
Prompt Injection DetectionBlock adversarial inputsRegex + classifier model
PII RedactionRemove sensitive dataPattern matching + NER
Content ModerationBlock harmful contentClassification API
Tool SandboxingLimit tool capabilitiesPermission levels + Docker
Rate LimitingPrevent resource abuseToken bucket algorithm
Audit LoggingTrack all agent actionsStructured logging + SIEM
Human EscalationFlag uncertain decisionsConfidence thresholds