Tags, Labels & Classification

Medium 20 min read

Overview

Why This Matters

Tags provide flexible classification for metadata: PII labeling, sensitivity tiers, deprecation warnings, certification status. Applied to datasets, columns, dashboards, and other entities. Combine with policies for automated governance.

Core Concepts

Tags, Labels & Classification is a critical capability in DataHub's metadata platform. Understanding the core concepts helps you implement effective metadata management.

Configuration

DataHub provides both UI-based and API-based configuration for tags, labels & classification. Most settings can be managed through the admin panel or programmatically via GraphQL.

Integration

Works seamlessly with DataHub's ingestion framework, search index, and event system. Changes are automatically propagated across the platform.

Automation

Leverage DataHub Actions to automate tags, labels & classification workflows. Trigger actions on metadata changes, schedule periodic checks, and integrate with external systems.

Monitoring

Track usage and effectiveness through DataHub's analytics. Monitor adoption metrics, coverage, and compliance with organizational standards.

How It Works

Configuration
# Configure tags, labels & classification via DataHub CLI
datahub put --urn "urn:li:dataset:(...)" \
  --aspect "datasetProperties" \
  -d '{"description": "Configured via CLI"}'

# Or via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata for tags, labels & classification
emitter.emit_mcp(
    entity_urn="urn:li:dataset:(...)",
    aspect_name="datasetProperties",
    aspect_value=DatasetPropertiesClass(
        description="Updated via SDK"
    )
)
Output
Successfully emitted metadata change proposal
  Entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.revenue,PROD)
  Aspect: datasetProperties
  Status: 200 OK
Key Takeaway: Tags are lightweight labels for classification (PII, Deprecated, Certified). Unlike glossary terms, tags have no definition or hierarchy -- they are simple markers used for filtering and triggering policies.

Architecture Integration

When tags, labels & classification metadata is updated, DataHub emits a Metadata Change Event (MCE) to Kafka. Downstream consumers update the search index (Elasticsearch) and graph index, ensuring all views stay consistent in near real-time.

Hands-On Tutorial

Step-by-Step Setup
# Step 1: Verify DataHub is running
curl -s http://localhost:8080/config | python3 -m json.tool

# Step 2: Configure tags, labels & classification via GraphQL
curl -X POST http://localhost:8080/api/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { updateDataset(urn: \"urn:li:dataset:(...)\" input: {}) }"}'

# Step 3: Verify in the UI
# Navigate to http://localhost:9002 and check the entity page

Common Mistake

Wrong: Creating dozens of overlapping tags like pii, PII, personal-data, sensitive

Why it fails: Inconsistent tagging means searches miss results and policies have gaps.

Instead: Define a controlled tag taxonomy upfront (e.g., pii, deprecated, certified, tier-1) and enforce it via governance. Use glossary terms for concepts that need definitions.

Deep Dive: Tags vs Glossary Terms

Tags and glossary terms both classify metadata, but serve different purposes. Tags are simple labels without definitions -- use them for operational classification (PII, deprecated, tier-1). Glossary terms have rich definitions, relationships, and ownership -- use them for business concepts (Revenue, Customer, Churn). A dataset can have both: tagged as pii (operational) and linked to the Customer glossary term (business meaning). Policies can trigger on either.

Key Takeaway: Combine tags with DataHub policies for automated governance. For example, a policy can prevent datasets tagged pii from being accessed by users outside the Security group.

Best Practices

Practice Problems

Practice 1

Design a tags, labels & classification strategy for a data team with 500 datasets across 8 databases. What do you prioritize? How do you measure success?

Practice 2

A new data engineer joins your team and needs to understand tags, labels & classification in DataHub. Create a 30-minute onboarding guide covering the essentials.

Practice 3

Your organization's tags, labels & classification adoption is at 30% after 3 months. Identify potential blockers and design an adoption acceleration plan.

Quick Reference

FeatureAccessNotes
UI ConfigurationSettings → Tags, Labels & ClassificationPoint-and-click setup
GraphQL APIPOST /api/graphqlProgrammatic access
Python SDKpip install acryl-datahubHigh-level client
CLIdatahub put / datahub getCommand-line operations
ActionsEvent-driven triggersAutomation framework