GraphQL & REST APIs | LIZIU DataHub

Overview

API-First Platform

DataHub is built API-first. Everything you can do in the UI can be done via the GraphQL API (primary) or REST API (OpenAPI). This enables automation, CI/CD integration, and custom tooling.

Core Concepts

GraphQL API

Primary API for querying and mutating metadata. Supports search, entity CRUD, lineage traversal. Available at /api/graphql.

REST (OpenAPI)

RESTful endpoints for entity operations. Swagger docs at /openapi/swagger-ui.

Python SDK

High-level Python client wrapping both APIs. Install via pip install acryl-datahub.

Authentication

Token-based auth (PATs) and OIDC. Tokens scoped to user permissions.

How It Works

GraphQL Queries

# Search for datasets
query { search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) {
    total searchResults { entity { urn type ... on Dataset { name } } }
} }

# Get dataset with lineage
query { dataset(urn: "urn:li:dataset:(...)") {
    name properties { description }
    ownership { owners { owner { urn } } }
    lineage(input: { direction: UPSTREAM, count: 10 }) { relationships { entity { urn } } }
} }

# Add a tag
mutation { addTag(input: { tagUrn: "urn:li:tag:PII", resourceUrn: "urn:li:dataset:(...)" }) }

Output

// Search response
{ "data": { "search": { "total": 42, "searchResults": [
  { "entity": { "urn": "urn:li:dataset:(...snowflake,analytics.revenue,PROD)", "type": "DATASET" } },
  { "entity": { "urn": "urn:li:dataset:(...bigquery,finance.revenue_daily,PROD)", "type": "DATASET" } }
] } } }

// Add tag response
{ "data": { "addTag": true } }

Hands-On Tutorial

Python SDK

from datahub.ingestion.graph.client import DataHubGraph
graph = DataHubGraph(config={"server": "http://localhost:8080"})
results = graph.execute_graphql("{ search(input: {type: DATASET, query: \"revenue\"}) { total } }")
print(results)

Output

{'search': {'total': 12}}

Key Takeaway: GraphQL is the primary API for DataHub. It supports search, entity CRUD, lineage traversal, and bulk operations. Use the Python SDK for scripting and the raw GraphQL endpoint for CI/CD integrations.

Common Mistake

Wrong: Fetching all entities without pagination: count: 10000

Why it fails: Large result sets cause GMS timeouts and high memory usage. Requests over 1000 results are slow and may fail.

Instead: Use pagination with start and count (max 100 per page). Iterate through pages for bulk operations.

Deep Dive: GraphQL vs REST API

DataHub supports both GraphQL and REST (OpenAPI). GraphQL is preferred because you can request exactly the fields you need in a single query, reducing over-fetching. The REST API is useful for simple CRUD operations and integrations that do not support GraphQL. For bulk metadata emission, use the Python SDK's DatahubRestEmitter which batches requests automatically. The GraphQL API also supports introspection -- use the playground at /api/graphiql to explore available queries and mutations.

Key Takeaway: Use Personal Access Tokens (PATs) for service accounts in CI/CD pipelines. Never embed user credentials. PATs inherit the permissions of the user who created them, so create a dedicated service account with appropriate policies.

Best Practices

Use GraphQL as primary API
Use Personal Access Tokens for service accounts
Implement pagination for search results
Use batch mutations for bulk operations

Practice Problems

Practice 1

Write a script that finds all datasets without an owner and notifies via Slack.

Quick Reference

Endpoint	Method	Purpose
/api/graphql	POST	All metadata operations
/openapi/v2/entity	GET/POST	REST CRUD

Overview

API-First Platform

Core Concepts

GraphQL API

REST (OpenAPI)

Python SDK

Authentication

How It Works

Hands-On Tutorial

Common Mistake

Best Practices

Practice Problems

Practice 1

Quick Reference

Related Topics