Delta Lake: The Lakehouse Foundation

Medium 35 min read

What is Delta Lake?

Why This Matters

The Problem: Traditional data lakes built on Parquet or CSV files lack reliability guarantees. Failed writes leave corrupted data, concurrent readers and writers conflict, and there is no way to undo mistakes or audit changes.

The Solution: Delta Lake adds a transaction log layer on top of Parquet files, providing ACID transactions, time travel, schema enforcement, and unified batch/streaming support -- turning your data lake into a reliable lakehouse.

Real Impact: Delta Lake is the default storage format on Databricks. Over 10,000 organizations rely on it to manage exabytes of data with the reliability of a data warehouse and the flexibility of a data lake.

Real-World Analogy

Think of Delta Lake as a version-controlled file system:

  • Parquet files = The actual documents (your data)
  • Transaction log (_delta_log/) = Git history -- every change is recorded
  • Time travel = git checkout to any past commit
  • ACID transactions = Every save is atomic -- no half-written files
  • Schema enforcement = A strict template that rejects malformed data
Delta Lake Architecture
Cloud Object Storage (S3 / ADLS / GCS) part-00000.parquet part-00001.parquet part-00002.parquet part-00003.parquet ... Partitioned by: year / month / day Delta Transaction Log (_delta_log/) 000000.json (v0) 000001.json (v1) 000002.json (v2) 000010.checkpoint 000011.json Each JSON file records: files added, files removed, metadata changes, schema references Query Engine Spark SQL PySpark Photon Engine SQL Warehouses Streaming reads log

ACID Transactions

Delta Lake provides full ACID (Atomicity, Consistency, Isolation, Durability) guarantees on your data lake. This means every write operation either fully succeeds or fully fails -- no partial or corrupted data.

Atomicity

Each write (INSERT, UPDATE, DELETE, MERGE) either fully completes or fully rolls back. No partial writes pollute your table.

Consistency

Schema enforcement ensures only valid data enters the table. Constraints and expectations catch data quality issues at write time.

Isolation

Concurrent readers never see partial writes. Writers use optimistic concurrency control to resolve conflicts automatically.

Durability

Once a write commits (appears in the transaction log), it is permanent. Data is stored on resilient cloud object storage.

Python/SQL - Creating and Writing Delta Tables
# Create a Delta table from a DataFrame
df.write.format("delta").mode("overwrite").save("/mnt/data/silver/employees")

# Create a managed Delta table (registered in metastore)
df.write.format("delta").saveAsTable("main.hr.employees")

# Append new data
new_employees.write.format("delta").mode("append").save("/mnt/data/silver/employees")

# Partitioned Delta table
events.write.format("delta") \
    .partitionBy("year", "month") \
    .mode("overwrite") \
    .save("/mnt/data/bronze/events")
SQL - Delta Table Operations
-- Create a Delta table with SQL
CREATE TABLE main.hr.employees (
    id INT,
    name STRING,
    department STRING,
    salary DOUBLE,
    hire_date DATE
) USING DELTA
PARTITIONED BY (department);

-- INSERT data
INSERT INTO main.hr.employees VALUES
    (1, 'Alice', 'Engineering', 95000, '2023-01-15'),
    (2, 'Bob', 'Data Science', 105000, '2022-06-01');

-- UPDATE specific rows
UPDATE main.hr.employees
SET salary = salary * 1.10
WHERE department = 'Engineering';

-- DELETE rows
DELETE FROM main.hr.employees
WHERE hire_date < '2020-01-01';

Time Travel

Every change to a Delta table is versioned. You can query any previous version of your data by version number or timestamp -- perfect for auditing, debugging, and recovering from mistakes.

Delta Time Travel: Versioned Data
v0 CREATE TABLE Jan 1 v1 INSERT 1000 rows Jan 5 v2 UPDATE salaries Jan 10 v3 MERGE new data Jan 15 v4 DELETE bad rows CURRENT SELECT * FROM table VERSION AS OF 1
SQL/Python - Time Travel Queries
-- Query a specific version
SELECT * FROM main.hr.employees VERSION AS OF 2;

-- Query as of a timestamp
SELECT * FROM main.hr.employees TIMESTAMP AS OF '2024-01-10T00:00:00';

-- View table history
DESCRIBE HISTORY main.hr.employees;

-- Compare two versions (data diff)
SELECT * FROM main.hr.employees VERSION AS OF 4
EXCEPT
SELECT * FROM main.hr.employees VERSION AS OF 3;

-- Restore to a previous version (undo mistakes)
RESTORE TABLE main.hr.employees TO VERSION AS OF 3;
Python - Time Travel with PySpark
# Read a specific version
df_v2 = spark.read.format("delta") \
    .option("versionAsOf", 2) \
    .load("/mnt/data/silver/employees")

# Read as of a timestamp
df_past = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-10") \
    .load("/mnt/data/silver/employees")

# View table history programmatically
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/data/silver/employees")
history = dt.history()
display(history.select("version", "timestamp", "operation", "operationMetrics"))

Schema Evolution

Delta Lake enforces schema by default -- writes that do not match the table schema are rejected. You can opt into schema evolution when you need to add new columns or change types.

Python/SQL - Schema Evolution
# Schema enforcement: this will FAIL if new_df has extra columns
new_df.write.format("delta").mode("append").save("/mnt/data/silver/employees")

# Enable schema evolution: automatically add new columns
new_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", True) \
    .save("/mnt/data/silver/employees")

# Overwrite schema entirely (replace all columns)
new_schema_df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", True) \
    .save("/mnt/data/silver/employees")
SQL - Schema Changes
-- Add a new column
ALTER TABLE main.hr.employees ADD COLUMN email STRING;

-- Change column type (where safe)
ALTER TABLE main.hr.employees ALTER COLUMN salary TYPE DECIMAL(12,2);

-- Rename a column
ALTER TABLE main.hr.employees RENAME COLUMN name TO full_name;

-- Add a table constraint
ALTER TABLE main.hr.employees ADD CONSTRAINT salary_positive CHECK (salary > 0);

MERGE Operations (Upserts)

MERGE is the most powerful Delta Lake operation -- it lets you insert, update, and delete rows in a single atomic operation based on matching conditions. This is essential for CDC (change data capture) and slowly changing dimensions.

SQL - MERGE (Upsert)
-- Classic upsert: insert new, update existing
MERGE INTO main.hr.employees AS target
USING staging.employee_updates AS source
ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET
        target.name = source.name,
        target.department = source.department,
        target.salary = source.salary
WHEN NOT MATCHED THEN
    INSERT (id, name, department, salary)
    VALUES (source.id, source.name, source.department, source.salary);

-- Full CDC merge with deletes
MERGE INTO main.hr.employees AS target
USING cdc_changes AS source
ON target.id = source.id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED AND source.operation = 'UPDATE' THEN
    UPDATE SET *
WHEN NOT MATCHED AND source.operation = 'INSERT' THEN
    INSERT *;
Python - MERGE with DeltaTable API
from delta.tables import DeltaTable

dt = DeltaTable.forPath(spark, "/mnt/data/silver/employees")

dt.alias("target").merge(
    updates_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    set={
        "name": "source.name",
        "salary": "source.salary",
        "updated_at": "current_timestamp()"
    }
).whenNotMatchedInsert(
    values={
        "id": "source.id",
        "name": "source.name",
        "salary": "source.salary",
        "created_at": "current_timestamp()"
    }
).execute()

OPTIMIZE & Z-ORDER

Over time, Delta tables accumulate many small files from streaming writes and incremental appends. OPTIMIZE compacts these into larger files for better read performance. Z-ORDER co-locates related data for faster filtered queries.

SQL - OPTIMIZE and Z-ORDER
-- Compact small files into larger ones (target: 1GB per file)
OPTIMIZE main.hr.employees;

-- OPTIMIZE with Z-ORDER on frequently filtered columns
OPTIMIZE main.events.clickstream
ZORDER BY (user_id, event_date);

-- OPTIMIZE only specific partitions
OPTIMIZE main.events.clickstream
WHERE event_date >= '2024-01-01'
ZORDER BY (user_id);

-- View file statistics
DESCRIBE DETAIL main.hr.employees;

-- Check number of files and sizes
SELECT * FROM (
    DESCRIBE DETAIL main.hr.employees
);

When to Z-ORDER

Z-ORDER is most effective for columns that appear frequently in WHERE clauses. Choose 1-4 columns with high cardinality. Common choices include: user_id, customer_id, event_date, region. Z-ORDER reshuffles data across files so that rows with similar Z-ORDER column values are stored together, enabling data skipping to eliminate entire files from scans.

VACUUM

VACUUM removes files that are no longer referenced by the current version of the Delta table. This reclaims storage but also removes the ability to time-travel to versions that depended on those files.

SQL - VACUUM Operations
-- Remove files older than 7 days (default retention)
VACUUM main.hr.employees;

-- Remove files older than 30 days
VACUUM main.hr.employees RETAIN 720 HOURS;

-- Dry run: see what would be deleted without deleting
VACUUM main.hr.employees RETAIN 168 HOURS DRY RUN;
Python - VACUUM with DeltaTable
from delta.tables import DeltaTable

dt = DeltaTable.forPath(spark, "/mnt/data/silver/employees")

# Vacuum with 7-day retention (168 hours)
dt.vacuum(retentionHours=168)

# WARNING: Setting retention to 0 hours loses ALL time travel
# Only do this if you are certain you do not need history
# spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
# dt.vacuum(retentionHours=0)

VACUUM Best Practices

  • Never set retention below 7 days in production (to protect long-running queries)
  • Schedule VACUUM as a daily or weekly maintenance job
  • Use DRY RUN first to see how much storage will be reclaimed
  • After VACUUM, time travel only works for versions within the retention period
  • Combine with OPTIMIZE: run OPTIMIZE first, then VACUUM

Practice Problems

Problem 1: Implement an SCD Type 2 with MERGE

Medium

You have a customers Delta table with columns (id, name, email, is_current, valid_from, valid_to). Write a MERGE statement that: (1) Updates changed records by setting is_current=false and valid_to=today, (2) Inserts the new version with is_current=true.

Problem 2: Data Recovery with Time Travel

Easy

A colleague accidentally deleted all rows where department='Engineering' from the employees table. The table is currently at version 15, and the DELETE happened at version 14. Write the commands to: (1) Verify what was deleted, (2) Restore the deleted rows without affecting other changes.

Problem 3: Optimize a Slow Query

Medium

Your clickstream table has 500 million rows across 50,000 small files. Queries filtering on user_id and event_date are slow. Design an optimization strategy using OPTIMIZE, Z-ORDER, partitioning, and VACUUM.

Quick Reference

CommandPurposeKey Detail
DESCRIBE HISTORYView version historyShows all operations, timestamps, metrics
VERSION AS OFTime travel by versionRead any past version of the table
TIMESTAMP AS OFTime travel by timeRead data as of a specific timestamp
RESTORERevert to past versionAtomic rollback to any version
MERGEUpsert / CDCInsert + update + delete in one operation
OPTIMIZECompact small filesReduces file count, improves read speed
ZORDER BYCo-locate dataEnables data skipping on filtered columns
VACUUMRemove old filesReclaims storage, limits time travel window

Useful Resources