Unity Catalog & Data Governance

Medium30 min read

Unity Catalog Overview

Why This Matters

The Problem: Without centralized governance, organizations face data silos, inconsistent permissions, no audit trail, and no visibility into how data flows through their platform.

The Solution: Unity Catalog provides a single governance layer across all Databricks workspaces -- unified permissions, data lineage, auditing, and data discovery in one place.

Real Impact: Unity Catalog enables organizations to comply with GDPR, HIPAA, and SOC 2 while giving data teams self-service access to the data they need.

Real-World Analogy

Think of Unity Catalog as a library catalog system:

  • Metastore = The entire library system (spans multiple branches/workspaces)
  • Catalog = A section of the library (e.g., "Science", "History")
  • Schema = A shelf within a section (e.g., "Physics", "Biology")
  • Table/View = An individual book on the shelf
  • Permissions = Library cards that control who can access what
Unity Catalog object model diagram showing the hierarchy from metastore to catalog to schema to tables, views, volumes, and functions
The Unity Catalog object model: a metastore contains catalogs, which contain schemas, which contain tables, views, volumes, and functions

Three-Level Namespace

Unity Catalog organizes all data assets into a three-level hierarchy: catalog.schema.table. This replaces the old Hive metastore's two-level database.table pattern and adds a top-level catalog for multi-environment or multi-team organization.

Unity Catalog catalogs view showing multiple catalogs organized by environment and team
Browsing catalogs in Unity Catalog -- catalogs are typically organized by environment (production, staging, sandbox) or by team/domain
Unity Catalog storage hierarchy showing how managed and external storage is organized
The Unity Catalog storage hierarchy showing the relationship between metastore storage, catalog storage, and schema storage locations
Unity Catalog Namespace Hierarchy
Metastore production staging sandbox sales analytics orders customers daily_report user_metrics Reference Path production.sales.orders catalog . schema . table SELECT * FROM production.sales.orders USE CATALOG production; USE sales; Metastore is per-region. One metastore can serve multiple workspaces. Catalogs organize by environment, team, or domain.
SQL - Creating the Namespace
Key Takeaway: Unity Catalog uses a three-level namespace (catalog.schema.table) for all data objects. Organize catalogs by environment (production, staging, sandbox) or by team/domain. One metastore spans multiple workspaces for unified governance.
-- Create a catalog
CREATE CATALOG IF NOT EXISTS production;

-- Create schemas within the catalog
CREATE SCHEMA IF NOT EXISTS production.sales;
CREATE SCHEMA IF NOT EXISTS production.analytics;
CREATE SCHEMA IF NOT EXISTS production.ml_features;

-- Create a managed table
CREATE TABLE production.sales.orders (
    order_id BIGINT,
    customer_id BIGINT,
    product STRING,
    amount DECIMAL(12,2),
    order_date DATE
) USING DELTA
COMMENT 'Production order data from e-commerce platform';

-- Create a view
CREATE VIEW production.analytics.daily_revenue AS
SELECT order_date, SUM(amount) AS revenue
FROM production.sales.orders
GROUP BY order_date;

-- Set default catalog and schema
USE CATALOG production;
USE SCHEMA sales;

-- Now you can reference just by table name
SELECT * FROM orders LIMIT 10;
Output
OK: Catalog 'production' created.
OK: Schema 'production.sales' created.
OK: Schema 'production.analytics' created.

+--------+-----------+--------+-------+----------+
|order_id|customer_id| product| amount|order_date|
+--------+-----------+--------+-------+----------+
|       1|       1001|Widget A| 125.50|2024-01-15|
|       2|       1002|Widget B|  89.99|2024-01-15|
+--------+-----------+--------+-------+----------+

GRANT & REVOKE Permissions

Unity Catalog uses a hierarchical permission model. Permissions granted at the catalog level cascade to all schemas and tables within it. You can grant privileges to users, groups, or service principals.

Granting a privilege to a user using the Catalog Explorer UI with the Grant dialog open
Granting privileges using Catalog Explorer -- the UI lets you select the principal, privilege type, and securable object
SQL - Permission Management
-- Grant catalog-level permissions
GRANT USE CATALOG ON CATALOG production TO `[email protected]`;
GRANT CREATE SCHEMA ON CATALOG production TO `[email protected]`;

-- Grant schema-level permissions
GRANT USE SCHEMA ON SCHEMA production.sales TO `[email protected]`;
GRANT SELECT ON SCHEMA production.sales TO `[email protected]`;

-- Grant table-level permissions
GRANT SELECT ON TABLE production.sales.orders TO `[email protected]`;
GRANT MODIFY ON TABLE production.sales.orders TO `etl-service-principal`;
GRANT ALL PRIVILEGES ON TABLE production.sales.orders TO `[email protected]`;

-- Revoke permissions
REVOKE SELECT ON TABLE production.sales.orders FROM `[email protected]`;

-- View grants on a table
SHOW GRANTS ON TABLE production.sales.orders;

-- View grants for a specific principal
SHOW GRANTS TO `[email protected]`;

-- Row-level security with row filters
ALTER TABLE production.sales.orders
SET ROW FILTER region_filter ON (region);

-- Column masking for sensitive data
ALTER TABLE production.sales.customers
ALTER COLUMN email SET MASK mask_email;
Output
+-----------------------------+---------------+---------------------------+
|                    principal|      privilege|              object        |
+-----------------------------+---------------+---------------------------+
|[email protected]   |ALL PRIVILEGES  |TABLE production.sales.orders|
|[email protected]   |SELECT          |TABLE production.sales.orders|
|[email protected]            |SELECT          |TABLE production.sales.orders|
+-----------------------------+---------------+---------------------------+
PrivilegeApplies ToDescription
USE CATALOGCatalogAccess objects within the catalog
USE SCHEMASchemaAccess objects within the schema
SELECTTable, ViewRead data from the object
MODIFYTableINSERT, UPDATE, DELETE, MERGE
CREATE TABLESchemaCreate new tables in the schema
CREATE SCHEMACatalogCreate new schemas in the catalog
ALL PRIVILEGESAnyGrant all applicable permissions
OWNERSHIPAnyFull control including GRANT/REVOKE

Data Lineage

Key Takeaway: Grant permissions at the highest logical level (catalog or schema) and they cascade down. Use groups instead of individual users for maintainability. Row filters and column masks add fine-grained security without duplicating tables.

Unity Catalog automatically tracks column-level lineage -- how data flows from source tables through transformations to downstream tables and dashboards. This is critical for impact analysis, debugging, and compliance.

Python - Exploring Lineage via API
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Get lineage for a specific table
lineage = w.data_lineage.get_table_lineage(
    table_name="production.analytics.daily_revenue"
)

# Show upstream dependencies
print("Upstream tables:")
for dep in lineage.upstream_tables:
    print(f"  - {dep.catalog_name}.{dep.schema_name}.{dep.name}")

# Show downstream consumers
print("Downstream tables:")
for dep in lineage.downstream_tables:
    print(f"  - {dep.catalog_name}.{dep.schema_name}.{dep.name}")

# Column-level lineage
col_lineage = w.data_lineage.get_column_lineage(
    table_name="production.analytics.daily_revenue",
    column_name="revenue"
)
for col in col_lineage.upstream_columns:
    print(f"  revenue derived from: {col.table_name}.{col.name}")

External Locations

Common Mistake

Wrong: GRANT ALL PRIVILEGES ON CATALOG production TO `all-users`

Why it fails: Granting ALL PRIVILEGES at the catalog level gives every user the ability to create, modify, and delete any table in any schema. This violates the principle of least privilege and creates compliance risks.

Instead: Grant USE CATALOG broadly, then grant SELECT/MODIFY at the schema or table level based on role. Use groups (data-engineers, analysts, data-scientists) and grant permissions to groups, not individuals.

External locations connect Unity Catalog to cloud storage paths outside the managed metastore storage. This lets you govern data that already exists in S3, ADLS, or GCS without moving it.

External locations overview showing cloud storage paths mapped to Unity Catalog
External locations map cloud storage paths (S3, ADLS, GCS) to Unity Catalog, enabling governance over data without moving it
SQL - External Locations and Tables
-- Create a storage credential (links to cloud IAM)
CREATE STORAGE CREDENTIAL my_adls_credential
WITH MANAGED IDENTITY '<managed-identity-id>';

-- Create an external location
CREATE EXTERNAL LOCATION raw_data_lake
URL 'abfss://[email protected]/'
WITH (STORAGE CREDENTIAL my_adls_credential)
COMMENT 'Raw data lake in ADLS Gen2';

-- Grant access to the external location
GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION raw_data_lake
TO `[email protected]`;

-- Create an external table pointing to existing data
CREATE TABLE production.sales.legacy_orders
USING DELTA
LOCATION 'abfss://[email protected]/orders/';

-- List external locations
SHOW EXTERNAL LOCATIONS;

Delta Sharing

Delta Sharing is an open protocol for secure, real-time data sharing across organizations, clouds, and platforms -- without copying data. Recipients can be Databricks users or use any client that supports the protocol (pandas, Spark, Power BI, Tableau).

SQL - Delta Sharing Setup
-- Create a share (a named collection of tables)
CREATE SHARE customer_analytics;

-- Add tables to the share
ALTER SHARE customer_analytics
ADD TABLE production.analytics.daily_revenue;

ALTER SHARE customer_analytics
ADD TABLE production.sales.orders
PARTITION (order_date >= '2024-01-01');  -- Share only recent data

-- Create a recipient (external organization)
CREATE RECIPIENT partner_company
USING ID 'partner-sharing-id';

-- Grant the share to the recipient
GRANT SELECT ON SHARE customer_analytics TO RECIPIENT partner_company;

-- View share details
SHOW ALL IN SHARE customer_analytics;

-- As a recipient: create a catalog from a share
CREATE CATALOG partner_data
USING SHARE provider_org.customer_analytics;

Practice Problems

Deep Dive: Managed vs External Tables in Unity Catalog

Unity Catalog supports two table types: managed and external. Managed tables store data in the metastore's managed storage location. When you drop a managed table, both the metadata AND the underlying data files are deleted. External tables point to data in a customer-specified location (via external locations). Dropping an external table only removes the metadata -- the data files remain. Use managed tables for new data that Unity Catalog fully controls. Use external tables for data that already exists in cloud storage or that needs to be accessed by systems outside Databricks. In both cases, Unity Catalog governs access through the same GRANT/REVOKE permission model.

Problem 1: Design a Multi-Team Governance Model

Medium

Your company has three teams: data engineering (builds pipelines), data science (trains models), and analytics (builds dashboards). Design a Unity Catalog structure with appropriate catalogs, schemas, and permissions so that: (1) Data engineers can write to all schemas, (2) Data scientists can read all data and write to ML feature tables, (3) Analysts can only read from curated gold tables.

Problem 2: Set Up Cross-Organization Data Sharing

Medium

Your company needs to share daily revenue data with a partner organization. Set up Delta Sharing so that the partner can only see revenue data from the last 90 days, with no access to customer PII.

Problem 3: Audit and Lineage Investigation

Hard

Your compliance team reports that customer PII might be exposed in a downstream analytics dashboard. Use Unity Catalog lineage and audit logs to: (1) Trace which tables feed into the dashboard, (2) Identify if any PII columns are included, (3) Add column masking to protect the data.

Quick Reference

ConceptCommandKey Detail
Create CatalogCREATE CATALOG nameTop-level container for schemas
Create SchemaCREATE SCHEMA catalog.nameContainer for tables and views
Grant PermissionsGRANT priv ON obj TO principalCascading permission model
Revoke PermissionsREVOKE priv ON obj FROM principalRemove specific access
External LocationCREATE EXTERNAL LOCATIONConnect cloud storage to UC
Delta SharingCREATE SHARE + GRANT TO RECIPIENTCross-org data sharing
Column MaskingALTER COLUMN SET MASK funcDynamic data masking
Row FiltersALTER TABLE SET ROW FILTERRow-level security

Useful Resources