Overview
The Heart of DataHub
DataHub's metadata model is based on the Entity-Aspect pattern developed at LinkedIn. Every piece of metadata is either an Entity (a thing you want to track) or an Aspect (a property of that thing). Understanding this model is essential for effective DataHub usage and customization.
Core Concepts
Entity Types
| Entity | URN Prefix | Description |
|---|---|---|
| Dataset | urn:li:dataset | Tables, views, topics, files |
| Dashboard | urn:li:dashboard | BI dashboards (Looker, Tableau) |
| Chart | urn:li:chart | Individual visualizations |
| DataFlow | urn:li:dataFlow | Pipelines (Airflow DAGs) |
| DataJob | urn:li:dataJob | Pipeline tasks (Airflow tasks) |
| MLModel | urn:li:mlModel | ML models |
| GlossaryTerm | urn:li:glossaryTerm | Business vocabulary |
| Domain | urn:li:domain | Business domains |
| CorpUser | urn:li:corpuser | Users |
| CorpGroup | urn:li:corpGroup | Teams/groups |
URN Format
# Dataset URN format:
# urn:li:dataset:(platform, name, environment)
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.analytics.revenue,PROD)
urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)
urn:li:dataset:(urn:li:dataPlatform:kafka,events.user_clicks,PROD)
# Dashboard URN:
urn:li:dashboard:(looker,dashboards.42)
# Pipeline URN:
urn:li:dataFlow:(airflow,revenue_pipeline,PROD)
urn:li:dataJob:(airflow,revenue_pipeline.transform_task,PROD)
Common Mistake
Wrong: urn:li:dataset:(snowflake,mydb.table,PROD)
Why it fails: The platform component must be a platform URN, not a plain string. DataHub will reject this URN.
Instead: urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.table,PROD)
How It Works
Aspects
Each entity has multiple aspects. Aspects are independently versioned and can be updated without affecting other aspects of the same entity.
# SchemaMetadata — column definitions
{
"fields": [
{ "fieldPath": "user_id", "type": "NUMBER", "description": "Primary key" },
{ "fieldPath": "email", "type": "STRING", "description": "User email" }
]
}
# Ownership — who owns this dataset
{
"owners": [
{ "owner": "urn:li:corpuser:jane", "type": "DATAOWNER" }
]
}
# UpstreamLineage — what feeds into this dataset
{
"upstreams": [
{ "dataset": "urn:li:dataset:(...raw_events...)", "type": "TRANSFORMED" }
]
}
Hands-On Tutorial
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, OwnershipClass, OwnerClass
)
# Emit metadata for a dataset
emitter = DatahubRestEmitter("http://localhost:8080")
# Set dataset properties
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)"
props = DatasetPropertiesClass(
description="Core users table with PII",
customProperties={"team": "platform", "sla": "99.9%"}
)
emitter.emit_mcp(dataset_urn, "datasetProperties", props)
# Set ownership
ownership = OwnershipClass(owners=[
OwnerClass(owner="urn:li:corpuser:jane", type="DATAOWNER")
])
emitter.emit_mcp(dataset_urn, "ownership", ownership)
Successfully emitted metadata for urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD) Aspect: datasetProperties - OK Aspect: ownership - OK
Best Practices
- Use consistent URN naming conventions across your organization
- Map your environments (DEV, STAGING, PROD) to DataHub fabric types
- Define required aspects for each entity type (e.g., every dataset must have an owner)
- Use custom properties for org-specific metadata before creating custom aspects
Deep Dive: Custom Aspects vs Custom Properties
DataHub offers two ways to extend metadata: custom properties (key-value pairs in DatasetProperties) and custom aspects (new typed schemas). Use custom properties for simple metadata like team names, SLAs, or cost centers. Create custom aspects only when you need typed fields, validation, or search indexing on structured data. Custom aspects require code changes to GMS, while custom properties work out of the box.
Practice Problems
Practice 1
Design a URN scheme for a company that uses Snowflake (3 environments), Kafka (2 clusters), and Looker (1 instance). How do you ensure uniqueness?
Practice 2
A dataset has 500 columns. Only 10 are frequently queried. How would you add column-level metadata (popularity, descriptions) efficiently using the aspect model?
Quick Reference
| Aspect | Entity Types | Contains |
|---|---|---|
| SchemaMetadata | Dataset | Columns, types, descriptions |
| Ownership | All | Owners and their roles |
| GlobalTags | All | Classification tags |
| GlossaryTerms | All | Business glossary associations |
| UpstreamLineage | Dataset, Chart | Data source dependencies |
| Status | All | Active/deprecated |
| DatasetProperties | Dataset | Description, custom properties |