GraphQL & REST APIs

Medium 25 min read

Overview

API-First Platform

DataHub is built API-first. Everything you can do in the UI can be done via the GraphQL API (primary) or REST API (OpenAPI). This enables automation, CI/CD integration, and custom tooling.

Core Concepts

GraphQL API

Primary API for querying and mutating metadata. Supports search, entity CRUD, lineage traversal. Available at /api/graphql.

REST (OpenAPI)

RESTful endpoints for entity operations. Swagger docs at /openapi/swagger-ui.

Python SDK

High-level Python client wrapping both APIs. Install via pip install acryl-datahub.

Authentication

Token-based auth (PATs) and OIDC. Tokens scoped to user permissions.

How It Works

GraphQL Queries
# Search for datasets
query { search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) {
    total searchResults { entity { urn type ... on Dataset { name } } }
} }

# Get dataset with lineage
query { dataset(urn: "urn:li:dataset:(...)") {
    name properties { description }
    ownership { owners { owner { urn } } }
    lineage(input: { direction: UPSTREAM, count: 10 }) { relationships { entity { urn } } }
} }

# Add a tag
mutation { addTag(input: { tagUrn: "urn:li:tag:PII", resourceUrn: "urn:li:dataset:(...)" }) }
Output
// Search response
{ "data": { "search": { "total": 42, "searchResults": [
  { "entity": { "urn": "urn:li:dataset:(...snowflake,analytics.revenue,PROD)", "type": "DATASET" } },
  { "entity": { "urn": "urn:li:dataset:(...bigquery,finance.revenue_daily,PROD)", "type": "DATASET" } }
] } } }

// Add tag response
{ "data": { "addTag": true } }

Hands-On Tutorial

Python SDK
from datahub.ingestion.graph.client import DataHubGraph
graph = DataHubGraph(config={"server": "http://localhost:8080"})
results = graph.execute_graphql("{ search(input: {type: DATASET, query: \"revenue\"}) { total } }")
print(results)
Output
{'search': {'total': 12}}
Key Takeaway: GraphQL is the primary API for DataHub. It supports search, entity CRUD, lineage traversal, and bulk operations. Use the Python SDK for scripting and the raw GraphQL endpoint for CI/CD integrations.

Common Mistake

Wrong: Fetching all entities without pagination: count: 10000

Why it fails: Large result sets cause GMS timeouts and high memory usage. Requests over 1000 results are slow and may fail.

Instead: Use pagination with start and count (max 100 per page). Iterate through pages for bulk operations.

Deep Dive: GraphQL vs REST API

DataHub supports both GraphQL and REST (OpenAPI). GraphQL is preferred because you can request exactly the fields you need in a single query, reducing over-fetching. The REST API is useful for simple CRUD operations and integrations that do not support GraphQL. For bulk metadata emission, use the Python SDK's DatahubRestEmitter which batches requests automatically. The GraphQL API also supports introspection -- use the playground at /api/graphiql to explore available queries and mutations.

Key Takeaway: Use Personal Access Tokens (PATs) for service accounts in CI/CD pipelines. Never embed user credentials. PATs inherit the permissions of the user who created them, so create a dedicated service account with appropriate policies.

Best Practices

Practice Problems

Practice 1

Write a script that finds all datasets without an owner and notifies via Slack.

Quick Reference

EndpointMethodPurpose
/api/graphqlPOSTAll metadata operations
/openapi/v2/entityGET/POSTREST CRUD