Getting Started with Python

What is Kafka Connect?

Kafka Connect is a framework for moving data between Kafka and other systems without writing code. It provides pre-built connectors for databases, file systems, cloud services, and more. Instead of writing custom producer/consumer code for each integration, you configure a connector with JSON and it handles the rest.

Source Connectors

Read data FROM external systems INTO Kafka topics. Example: Debezium CDC reads every INSERT/UPDATE/DELETE from MySQL and publishes to Kafka.

Sink Connectors

Write data FROM Kafka topics TO external systems. Example: Elasticsearch sink writes Kafka messages into Elasticsearch for search.

Popular Connectors

Connector	Type	Use Case
Debezium	Source	Change Data Capture from MySQL, PostgreSQL, MongoDB, SQL Server
JDBC Source	Source	Poll database tables for new/changed rows
S3 Sink	Sink	Write Kafka messages to S3 as Parquet/Avro/JSON files
Elasticsearch Sink	Sink	Index Kafka messages for search and analytics
JDBC Sink	Sink	Write Kafka messages to any JDBC database
BigQuery Sink	Sink	Stream data into Google BigQuery
FileStream	Both	Read/write files (testing and demos only)

Configuring a Connector

{
  "name": "mysql-cdc-source",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql-host",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "secret",
    "database.server.id": "1",
    "topic.prefix": "myapp",
    "database.include.list": "mydb",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changes"
  }
}

# Deploy a connector via REST API
curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @mysql-cdc-config.json

# List running connectors
curl http://localhost:8083/connectors

# Check connector status
curl http://localhost:8083/connectors/mysql-cdc-source/status

# Pause/Resume a connector
curl -X PUT http://localhost:8083/connectors/mysql-cdc-source/pause
curl -X PUT http://localhost:8083/connectors/mysql-cdc-source/resume

Key Takeaway: Use Kafka Connect instead of writing custom integration code. Debezium for CDC, S3 Sink for archival, JDBC Sink for database sync. Connect handles offset tracking, retries, and schema conversion automatically.

⚠️ Common Mistake: Using JDBC Source when you need CDC

JDBC Source polls the database on a schedule (e.g., every 10 seconds). It misses DELETE operations and has latency. Debezium CDC reads the database's binary log and captures every change in real-time — inserts, updates, AND deletes. Use Debezium for real-time sync; JDBC Source only for batch/periodic loads.

Kafka Connect Architecture

Source DB

MySQL/Postgres

→

Source Connector

Debezium CDC

→

Kafka Topics

event log

→

Sink Connector

Elasticsearch

→

Search Index

queryable

Practice Exercises

Medium Build a Mini Project

Combine concepts from this tutorial to build a small utility or tool.

Medium Debug Challenge

Introduce a bug in one of the code examples and practice finding and fixing it.

Hard Refactoring Exercise

Rewrite one example using a different approach and compare the tradeoffs.