Metadata Model & Entities

Medium 25 min read

Overview

The Heart of DataHub

DataHub's metadata model is based on the Entity-Aspect pattern developed at LinkedIn. Every piece of metadata is either an Entity (a thing you want to track) or an Aspect (a property of that thing). Understanding this model is essential for effective DataHub usage and customization.

Core Concepts

Entity Types

EntityURN PrefixDescription
Dataseturn:li:datasetTables, views, topics, files
Dashboardurn:li:dashboardBI dashboards (Looker, Tableau)
Charturn:li:chartIndividual visualizations
DataFlowurn:li:dataFlowPipelines (Airflow DAGs)
DataJoburn:li:dataJobPipeline tasks (Airflow tasks)
MLModelurn:li:mlModelML models
GlossaryTermurn:li:glossaryTermBusiness vocabulary
Domainurn:li:domainBusiness domains
CorpUserurn:li:corpuserUsers
CorpGroupurn:li:corpGroupTeams/groups

URN Format

URN Examples
# Dataset URN format:
# urn:li:dataset:(platform, name, environment)
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.analytics.revenue,PROD)
urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)
urn:li:dataset:(urn:li:dataPlatform:kafka,events.user_clicks,PROD)

# Dashboard URN:
urn:li:dashboard:(looker,dashboards.42)

# Pipeline URN:
urn:li:dataFlow:(airflow,revenue_pipeline,PROD)
urn:li:dataJob:(airflow,revenue_pipeline.transform_task,PROD)

Common Mistake

Wrong: urn:li:dataset:(snowflake,mydb.table,PROD)

Why it fails: The platform component must be a platform URN, not a plain string. DataHub will reject this URN.

Instead: urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.table,PROD)

How It Works

Aspects

Each entity has multiple aspects. Aspects are independently versioned and can be updated without affecting other aspects of the same entity.

Common Dataset Aspects
# SchemaMetadata — column definitions
{
  "fields": [
    { "fieldPath": "user_id", "type": "NUMBER", "description": "Primary key" },
    { "fieldPath": "email", "type": "STRING", "description": "User email" }
  ]
}

# Ownership — who owns this dataset
{
  "owners": [
    { "owner": "urn:li:corpuser:jane", "type": "DATAOWNER" }
  ]
}

# UpstreamLineage — what feeds into this dataset
{
  "upstreams": [
    { "dataset": "urn:li:dataset:(...raw_events...)", "type": "TRANSFORMED" }
  ]
}

Hands-On Tutorial

Query Metadata via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass, OwnershipClass, OwnerClass
)

# Emit metadata for a dataset
emitter = DatahubRestEmitter("http://localhost:8080")

# Set dataset properties
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)"
props = DatasetPropertiesClass(
    description="Core users table with PII",
    customProperties={"team": "platform", "sla": "99.9%"}
)
emitter.emit_mcp(dataset_urn, "datasetProperties", props)

# Set ownership
ownership = OwnershipClass(owners=[
    OwnerClass(owner="urn:li:corpuser:jane", type="DATAOWNER")
])
emitter.emit_mcp(dataset_urn, "ownership", ownership)
Output
Successfully emitted metadata for urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)
  Aspect: datasetProperties - OK
  Aspect: ownership - OK
Key Takeaway: Aspects are independently versioned -- updating ownership does not affect schema metadata or lineage. This means multiple teams can update different aspects of the same entity without conflicts.

Best Practices

Deep Dive: Custom Aspects vs Custom Properties

DataHub offers two ways to extend metadata: custom properties (key-value pairs in DatasetProperties) and custom aspects (new typed schemas). Use custom properties for simple metadata like team names, SLAs, or cost centers. Create custom aspects only when you need typed fields, validation, or search indexing on structured data. Custom aspects require code changes to GMS, while custom properties work out of the box.

Key Takeaway: Use consistent URN naming conventions and map your environments (DEV, STAGING, PROD) to DataHub fabric types. This prevents duplicate entities and ensures lineage connects correctly across environments.

Practice Problems

Practice 1

Design a URN scheme for a company that uses Snowflake (3 environments), Kafka (2 clusters), and Looker (1 instance). How do you ensure uniqueness?

Practice 2

A dataset has 500 columns. Only 10 are frequently queried. How would you add column-level metadata (popularity, descriptions) efficiently using the aspect model?

Quick Reference

AspectEntity TypesContains
SchemaMetadataDatasetColumns, types, descriptions
OwnershipAllOwners and their roles
GlobalTagsAllClassification tags
GlossaryTermsAllBusiness glossary associations
UpstreamLineageDataset, ChartData source dependencies
StatusAllActive/deprecated
DatasetPropertiesDatasetDescription, custom properties