Overview
API-First Platform
DataHub is built API-first. Everything you can do in the UI can be done via the GraphQL API (primary) or REST API (OpenAPI). This enables automation, CI/CD integration, and custom tooling.
Core Concepts
GraphQL API
Primary API for querying and mutating metadata. Supports search, entity CRUD, lineage traversal. Available at /api/graphql.
REST (OpenAPI)
RESTful endpoints for entity operations. Swagger docs at /openapi/swagger-ui.
Python SDK
High-level Python client wrapping both APIs. Install via pip install acryl-datahub.
Authentication
Token-based auth (PATs) and OIDC. Tokens scoped to user permissions.
How It Works
# Search for datasets
query { search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) {
total searchResults { entity { urn type ... on Dataset { name } } }
} }
# Get dataset with lineage
query { dataset(urn: "urn:li:dataset:(...)") {
name properties { description }
ownership { owners { owner { urn } } }
lineage(input: { direction: UPSTREAM, count: 10 }) { relationships { entity { urn } } }
} }
# Add a tag
mutation { addTag(input: { tagUrn: "urn:li:tag:PII", resourceUrn: "urn:li:dataset:(...)" }) }// Search response
{ "data": { "search": { "total": 42, "searchResults": [
{ "entity": { "urn": "urn:li:dataset:(...snowflake,analytics.revenue,PROD)", "type": "DATASET" } },
{ "entity": { "urn": "urn:li:dataset:(...bigquery,finance.revenue_daily,PROD)", "type": "DATASET" } }
] } } }
// Add tag response
{ "data": { "addTag": true } }
Hands-On Tutorial
from datahub.ingestion.graph.client import DataHubGraph
graph = DataHubGraph(config={"server": "http://localhost:8080"})
results = graph.execute_graphql("{ search(input: {type: DATASET, query: \"revenue\"}) { total } }")
print(results){'search': {'total': 12}}
Common Mistake
Wrong: Fetching all entities without pagination: count: 10000
Why it fails: Large result sets cause GMS timeouts and high memory usage. Requests over 1000 results are slow and may fail.
Instead: Use pagination with start and count (max 100 per page). Iterate through pages for bulk operations.
Deep Dive: GraphQL vs REST API
DataHub supports both GraphQL and REST (OpenAPI). GraphQL is preferred because you can request exactly the fields you need in a single query, reducing over-fetching. The REST API is useful for simple CRUD operations and integrations that do not support GraphQL. For bulk metadata emission, use the Python SDK's DatahubRestEmitter which batches requests automatically. The GraphQL API also supports introspection -- use the playground at /api/graphiql to explore available queries and mutations.
Best Practices
- Use GraphQL as primary API
- Use Personal Access Tokens for service accounts
- Implement pagination for search results
- Use batch mutations for bulk operations
Practice Problems
Practice 1
Write a script that finds all datasets without an owner and notifies via Slack.
Quick Reference
| Endpoint | Method | Purpose |
|---|---|---|
| /api/graphql | POST | All metadata operations |
| /openapi/v2/entity | GET/POST | REST CRUD |