Custom Metadata & Extensions

Hard 25 min read

Overview

Why This Matters

DataHub is extensible: add custom properties to entities, create structured custom aspects, or define new entity types. Enables org-specific metadata like cost centers, compliance levels, or data tier classifications.

Core Concepts

Custom Metadata & Extensions is a critical capability in DataHub's metadata platform. Understanding the core concepts helps you implement effective metadata management.

Configuration

DataHub provides both UI-based and API-based configuration for custom metadata & extensions. Most settings can be managed through the admin panel or programmatically via GraphQL.

Integration

Works seamlessly with DataHub's ingestion framework, search index, and event system. Changes are automatically propagated across the platform.

Automation

Leverage DataHub Actions to automate custom metadata & extensions workflows. Trigger actions on metadata changes, schedule periodic checks, and integrate with external systems.

Monitoring

Track usage and effectiveness through DataHub's analytics. Monitor adoption metrics, coverage, and compliance with organizational standards.

How It Works

Configuration
# Configure custom metadata & extensions via DataHub CLI
datahub put --urn "urn:li:dataset:(...)" \
  --aspect "datasetProperties" \
  -d '{"description": "Configured via CLI"}'

# Or via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata for custom metadata & extensions
emitter.emit_mcp(
    entity_urn="urn:li:dataset:(...)",
    aspect_name="datasetProperties",
    aspect_value=DatasetPropertiesClass(
        description="Updated via SDK"
    )
)
Output
Successfully emitted metadata change proposal
  Entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.revenue,PROD)
  Aspect: datasetProperties
  Status: 200 OK
Key Takeaway: DataHub offers three extension levels: custom properties (key-value pairs, zero code), structured properties (typed fields, UI-configurable), and custom aspects (full schema, requires GMS code changes). Start simple and escalate only when needed.

Architecture Integration

When custom metadata & extensions metadata is updated, DataHub emits a Metadata Change Event (MCE) to Kafka. Downstream consumers update the search index (Elasticsearch) and graph index, ensuring all views stay consistent in near real-time.

Hands-On Tutorial

Step-by-Step Setup
# Step 1: Verify DataHub is running
curl -s http://localhost:8080/config | python3 -m json.tool

# Step 2: Configure custom metadata & extensions via GraphQL
curl -X POST http://localhost:8080/api/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { updateDataset(urn: \"urn:li:dataset:(...)\" input: {}) }"}'

# Step 3: Verify in the UI
# Navigate to http://localhost:9002 and check the entity page

Common Mistake

Wrong: Creating custom aspects for every new metadata need

Why it fails: Custom aspects require GMS code changes, rebuilds, and redeployments. This creates maintenance burden and upgrade friction.

Instead: Use customProperties (key-value pairs in DatasetProperties) for simple metadata. Use structured properties (UI-defined) for typed fields. Reserve custom aspects for metadata that needs search indexing or complex validation.

Deep Dive: Structured Properties

Structured properties are a middle ground between custom properties (untyped) and custom aspects (code changes required). Define them via the UI or API with a name, type (string, number, date, enum), and allowed entity types. They appear on entity pages, are searchable, and can have validation rules. Examples: cost center (enum), data retention days (number), compliance review date (date). This feature was added in DataHub v0.12+ and is the recommended approach for most extension needs.

Key Takeaway: Before building custom extensions, check if existing aspects already cover your need. Ownership, tags, glossary terms, domains, and custom properties handle 90% of metadata requirements without any customization.

Best Practices

Practice Problems

Practice 1

Design a custom metadata & extensions strategy for a data team with 500 datasets across 8 databases. What do you prioritize? How do you measure success?

Practice 2

A new data engineer joins your team and needs to understand custom metadata & extensions in DataHub. Create a 30-minute onboarding guide covering the essentials.

Practice 3

Your organization's custom metadata & extensions adoption is at 30% after 3 months. Identify potential blockers and design an adoption acceleration plan.

Quick Reference

FeatureAccessNotes
UI ConfigurationSettings → Custom Metadata & ExtensionsPoint-and-click setup
GraphQL APIPOST /api/graphqlProgrammatic access
Python SDKpip install acryl-datahubHigh-level client
CLIdatahub put / datahub getCommand-line operations
ActionsEvent-driven triggersAutomation framework