Kafka Connect

Source and sink connectors for integrating databases, S3, Elasticsearch, and more.

Intermediate 35 min read 📨 Kafka

What is Kafka Connect?

Kafka Connect is a framework for moving data between Kafka and other systems without writing code. It provides pre-built connectors for databases, file systems, cloud services, and more. Instead of writing custom producer/consumer code for each integration, you configure a connector with JSON and it handles the rest.

Source Connectors

Read data FROM external systems INTO Kafka topics. Example: Debezium CDC reads every INSERT/UPDATE/DELETE from MySQL and publishes to Kafka.

Sink Connectors

Write data FROM Kafka topics TO external systems. Example: Elasticsearch sink writes Kafka messages into Elasticsearch for search.

ConnectorTypeUse Case
DebeziumSourceChange Data Capture from MySQL, PostgreSQL, MongoDB, SQL Server
JDBC SourceSourcePoll database tables for new/changed rows
S3 SinkSinkWrite Kafka messages to S3 as Parquet/Avro/JSON files
Elasticsearch SinkSinkIndex Kafka messages for search and analytics
JDBC SinkSinkWrite Kafka messages to any JDBC database
BigQuery SinkSinkStream data into Google BigQuery
FileStreamBothRead/write files (testing and demos only)

Configuring a Connector

{
  "name": "mysql-cdc-source",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql-host",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "secret",
    "database.server.id": "1",
    "topic.prefix": "myapp",
    "database.include.list": "mydb",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changes"
  }
}
# Deploy a connector via REST API
curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @mysql-cdc-config.json

# List running connectors
curl http://localhost:8083/connectors

# Check connector status
curl http://localhost:8083/connectors/mysql-cdc-source/status

# Pause/Resume a connector
curl -X PUT http://localhost:8083/connectors/mysql-cdc-source/pause
curl -X PUT http://localhost:8083/connectors/mysql-cdc-source/resume
Key Takeaway: Use Kafka Connect instead of writing custom integration code. Debezium for CDC, S3 Sink for archival, JDBC Sink for database sync. Connect handles offset tracking, retries, and schema conversion automatically.

⚠️ Common Mistake: Using JDBC Source when you need CDC

JDBC Source polls the database on a schedule (e.g., every 10 seconds). It misses DELETE operations and has latency. Debezium CDC reads the database's binary log and captures every change in real-time — inserts, updates, AND deletes. Use Debezium for real-time sync; JDBC Source only for batch/periodic loads.

Kafka Connect Architecture
Source DB
MySQL/Postgres
Source Connector
Debezium CDC
Kafka Topics
event log
Sink Connector
Elasticsearch
Search Index
queryable

Practice Exercises

Medium Build a Mini Project

Combine concepts from this tutorial to build a small utility or tool.

Medium Debug Challenge

Introduce a bug in one of the code examples and practice finding and fixing it.

Hard Refactoring Exercise

Rewrite one example using a different approach and compare the tradeoffs.