Security, IAM & Network Controls

Hard 25 min read

Identity & Access Management

Why Security Matters in Databricks

The Problem: Data platforms handle sensitive business data, PII, and financial records. A misconfigured access control can lead to data breaches, compliance violations, and significant financial penalties.

The Solution: Databricks provides a layered security model with identity management, fine-grained access controls via Unity Catalog, network isolation, encryption, and comprehensive audit logging.

Real Impact: Organizations in regulated industries (healthcare, finance, government) use Databricks security features to achieve HIPAA, SOC 2, FedRAMP, and GDPR compliance.

Databricks identity management operates at two levels: account-level (across all workspaces) and workspace-level (within a single workspace). Understanding this hierarchy is essential for designing secure multi-team environments.

Databricks IAM Role Mapping Architecture
Identity Provider (IdP) Azure AD / Okta / OneLogin SCIM Sync Account Level Account Admin Users & Groups Service Principals Centralized identity management across all workspaces Workspace: Production WS Admin Data Engineer Data Analyst Read Only Workspace: Staging WS Admin Data Engineer CI/CD Service Principal Workspace: Development WS Admin All Engineers Data Scientists
Role Scope Capabilities
Account Admin All workspaces Manage users, groups, workspaces, Unity Catalog metastore
Workspace Admin Single workspace Manage workspace settings, clusters, permissions
Metastore Admin Unity Catalog Manage catalogs, schemas, grants, data lineage
Users Workspace Access granted resources, run notebooks, submit jobs
Key Takeaway: Follow least-privilege access: assign permissions at the Unity Catalog level (catalog > schema > table), not at the workspace level. Use groups (not individual users) for permission management to simplify access reviews and offboarding.
Output
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `analysts`;
GRANT ALL PRIVILEGES ON SCHEMA prod_catalog.silver TO `data-engineers`;
-- Result: 3 permission grants applied successfully

SHOW GRANTS ON SCHEMA prod_catalog.gold;
+----------------+------------+--------------------------+
| principal      | action     | object                   |
+----------------+------------+--------------------------+
| data-engineers | ALL_PRIV   | prod_catalog.gold        |
| analysts       | SELECT     | prod_catalog.gold        |
+----------------+------------+--------------------------+

Service Principals

Service principals are non-human identities used for automation, CI/CD pipelines, and programmatic access. They are the recommended authentication method for production workloads -- never use personal access tokens from human users in automated systems.

Python - Service Principal Authentication
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.iam import ServicePrincipal

# Authenticate using service principal (OAuth M2M)
w = WorkspaceClient(
    host="https://adb-1234567890.12.azuredatabricks.net",
    client_id="your-service-principal-client-id",
    client_secret="your-service-principal-secret"
)

# List clusters using the service principal identity
for cluster in w.clusters.list():
    print(f"{cluster.cluster_name}: {cluster.state}")

# Create a service principal via the Account API
from databricks.sdk import AccountClient

account = AccountClient(
    host="https://accounts.azuredatabricks.net",
    account_id="your-account-id",
    client_id="admin-sp-client-id",
    client_secret="admin-sp-secret"
)

# Create new service principal
sp = account.service_principals.create(
    display_name="etl-pipeline-sp",
    active=True
)
print(f"Created SP: {sp.application_id}")

# Grant workspace access to the service principal
account.workspace_assignment.update(
    workspace_id=1234567890,
    principal_id=sp.id,
    permissions=["USER"]
)

Common Mistake

Wrong: Using a personal user account for CI/CD pipeline authentication

Why it fails: Personal tokens expire, are tied to individual employees (who may leave), and grant that user's full permissions to the pipeline.

Instead: Create a dedicated service principal with only the permissions needed for the pipeline. Use OAuth M2M authentication with client_id and client_secret stored in CI/CD secrets.

Token Authentication

Databricks supports multiple authentication methods. Personal access tokens (PATs) are the simplest but least secure. OAuth tokens with service principals are recommended for production.

Personal Access Tokens (PATs)

Generated per user, scoped to a workspace. Good for development and testing. Set short TTLs and rotate regularly. Never commit to source control.

OAuth (M2M)

Service principal authentication using OAuth client credentials flow. Recommended for CI/CD and production automation. Tokens auto-rotate.

Azure AD Tokens

For Azure Databricks, use Azure AD tokens with managed identities. Integrates with Azure RBAC for unified access control.

AWS IAM Credentials

For AWS Databricks, use instance profiles and IAM roles. Provides temporary credentials that auto-rotate without manual management.

Output
$ databricks tokens create --lifetime-seconds 86400 --comment "ci-cd-daily"
{
  "token_value": "dapi_a1b2c3d4e5f6...",
  "token_info": {
    "token_id": "12345",
    "creation_time": 1710547200,
    "expiry_time": 1710633600,
    "comment": "ci-cd-daily"
  }
}
WARNING: Token shown only once. Store securely.

Network Security (VPC/Private Link)

Network security in Databricks involves isolating the data plane within your cloud VPC/VNet and controlling all network traffic paths. For enterprise deployments, Private Link eliminates data traversal over the public internet.

Databricks Network Security Architecture
Public Internet IP Access Lists / WAF Databricks Control Plane Web App / API Cluster Manager Unity Catalog / Jobs Scheduler Private Your VPC / VNet Private Subnet Spark Workers Driver Nodes SQL Warehouses Storage Subnet S3 / ADLS Delta Tables Service Endpoints Network Security Groups / NACLs Key Vault / KMS Private DNS Zone

IP Access Lists

IP access lists restrict which IP addresses can access the Databricks workspace API and UI. This provides an additional security layer on top of network controls, ensuring only traffic from approved corporate networks can reach your workspace.

Python - Manage IP Access Lists
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.settings import CreateIpAccessList

w = WorkspaceClient()

# Create an IP access list (allow list)
ip_list = w.ip_access_lists.create(
    label="Corporate VPN",
    list_type="ALLOW",
    ip_addresses=[
        "10.0.0.0/8",         # Internal network
        "203.0.113.0/24",     # Office IP range
        "198.51.100.50/32",   # VPN gateway
    ]
)

# Enable IP access list enforcement
w.workspace_conf.set_status({
    "enableIpAccessLists": "true"
})

# List all configured IP access lists
for acl in w.ip_access_lists.list():
    print(f"{acl.label}: {acl.ip_addresses}")
Deep Dive: Private Link vs VPC Peering

VPC Peering connects your VPC to the Databricks control plane VPC via AWS/Azure peering -- simple but exposes the entire VPC network. Private Link creates a private endpoint in your VPC that routes only Databricks traffic through the AWS/Azure backbone, never the public internet. Private Link is more secure (no public IP exposure), more compliant (required for HIPAA/PCI), and easier to manage (no complex routing tables). The tradeoff: Private Link requires additional setup per workspace and has a small monthly cost ($7-10/endpoint). For production workloads handling sensitive data, always use Private Link.

Secrets Management

Databricks secrets provide a secure way to store and access sensitive information like API keys, database passwords, and connection strings. Secrets are stored in scopes and can be backed by Databricks-managed storage or external vaults like Azure Key Vault or AWS Secrets Manager.

Python - Secrets API
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create a secret scope
w.secrets.create_scope(scope="production-secrets")

# Store a secret
w.secrets.put_secret(
    scope="production-secrets",
    key="database-password",
    string_value="super-secret-password-123"
)

# In a notebook, retrieve the secret
# Secrets are REDACTED in notebook output
password = dbutils.secrets.get(
    scope="production-secrets",
    key="database-password"
)

# Use the secret in a JDBC connection
df = spark.read.format("jdbc").options(
    url="jdbc:postgresql://db-host:5432/mydb",
    dbtable="public.users",
    user="etl_user",
    password=password
).load()

# Grant access to a group
w.secrets.put_acl(
    scope="production-secrets",
    principal="data-engineers",
    permission="READ"
)

# List secrets (keys only, values are never exposed)
for secret in w.secrets.list_secrets(scope="production-secrets"):
    print(f"Key: {secret.key}, Last Updated: {secret.last_updated_timestamp}")
Key Takeaway: Never store secrets in notebooks, environment variables, or Git repos. Use Databricks secret scopes with ACLs. In notebooks, access secrets with dbutils.secrets.get(scope, key) -- the value is automatically redacted in notebook output and logs.
Output
$ databricks secrets list-secrets my-app-secrets
+------------------+-------------------+
| key              | last_updated      |
+------------------+-------------------+
| db-password      | 2024-03-15 10:23  |
| api-key          | 2024-03-10 14:45  |
| encryption-key   | 2024-02-28 09:12  |
+------------------+-------------------+
(Values are never shown -- only accessible via dbutils.secrets.get)

Audit Logging

Databricks audit logs capture detailed records of all actions performed in your workspace, including who accessed what data, when clusters were created, and which notebooks were executed. These logs are essential for compliance, security investigations, and operational monitoring.

Audit Log Categories

  • Workspace events: Login, notebook access, cluster operations
  • Account events: User management, workspace provisioning
  • Unity Catalog events: Data access, grant changes, lineage queries
  • DBSQL events: SQL warehouse queries, dashboard access
  • Secrets events: Secret scope access, secret retrieval
SQL - Query Audit Logs
-- Audit logs delivered to your cloud storage (S3/ADLS/GCS)
-- Create an external table over the audit log files

CREATE TABLE IF NOT EXISTS audit.logs
USING JSON
LOCATION 's3://your-audit-bucket/databricks/audit-logs/';

-- Find all failed login attempts in the last 24 hours
SELECT
    timestamp,
    userIdentity.email AS user_email,
    sourceIPAddress,
    requestParams.user AS target_user,
    response.statusCode
FROM audit.logs
WHERE actionName = 'login'
    AND response.statusCode != 200
    AND timestamp >= current_timestamp() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;

-- Track who accessed sensitive tables
SELECT
    timestamp,
    userIdentity.email,
    actionName,
    requestParams.full_name_arg AS table_name
FROM audit.logs
WHERE serviceName = 'unityCatalog'
    AND actionName IN ('getTable', 'readVolume')
    AND requestParams.full_name_arg LIKE '%pii%'
ORDER BY timestamp DESC;

Common Mistake

Wrong: Not enabling audit logging until after a security incident

Why it fails: Without audit logs, you cannot determine what data was accessed, by whom, or when. Compliance audits and incident investigations become impossible.

Instead: Enable audit logging on Day 1. Configure log delivery to a secure S3/ADLS bucket with retention policies. Set up alerts for suspicious patterns (bulk data downloads, permission changes, failed auth attempts).

Practice Problems

Problem 1: Secure Multi-Team Access Design

Hard

Your organization has three teams: Data Engineering, Data Science, and Business Analytics. Data Engineers need full access to all bronze/silver/gold tables. Data Scientists need read access to silver/gold plus write access to a sandbox catalog. Business Analysts need read-only access to gold tables. Design the Unity Catalog grants and workspace configuration.

Problem 2: Network Security Architecture

Hard

Your company requires that no data traffic traverses the public internet, all API access comes from the corporate VPN, and all encryption keys are customer-managed. Design the network and security architecture for a Databricks deployment on AWS.

Problem 3: Secret Rotation Strategy

Medium

Your team stores database credentials and API keys in Databricks secret scopes. Currently, secrets are rotated manually every 90 days, which sometimes gets forgotten. Design an automated secret rotation strategy.

Quick Reference

Security Cheat Sheet

Feature Purpose Best Practice
Service Principals Non-human authentication Use for all CI/CD and automation
Unity Catalog Fine-grained data access control Grant at schema level, not table level
Private Link Eliminate public internet traffic Required for regulated industries
IP Access Lists Restrict API/UI access by IP Allow only VPN/office CIDRs
Secrets Secure credential storage Use vault-backed scopes with auto-rotation
Audit Logs Compliance and investigation Deliver to cloud storage, query with SQL
Customer-Managed Keys Encryption key control Use for HIPAA/PCI workloads