Getting Started with Python

What is Kafka Streams?

Kafka Streams is a client library for building real-time stream processing applications. Unlike Spark Streaming or Flink, it doesn't require a separate cluster — your stream processing logic runs inside your application as a regular Java/Kotlin dependency. It reads from Kafka topics, processes data, and writes results back to Kafka topics.

No separate infrastructure needed

Kafka Streams is just a library in your application. No cluster to manage, no job scheduler, no special deployment. Scale by adding more instances of your application. Kafka handles partition assignment automatically.

KStream vs KTable

Kafka Streams has two core abstractions:

Concept	Model	Think of it as	Example
KStream	Event stream	Append-only log of facts	Page views, clicks, orders placed
KTable	Changelog	Latest value per key (like a DB table)	User profiles, product inventory, account balance

A KStream treats every record as an independent event. A click at 10:00 and a click at 10:05 are two separate events. A KTable treats records as updates — if user Alice's balance changes from $100 to $150, the KTable only keeps $150.

Stream Operations

Stateless Operations

StreamsBuilder builder = new StreamsBuilder();

KStream<String, String> events = builder.stream("raw-events");

// Filter: keep only purchase events
KStream<String, String> purchases = events
    .filter((key, value) -> value.contains("purchase"));

// Map: transform values
KStream<String, String> enriched = purchases
    .mapValues(value -> addTimestamp(value));

// Branch: split stream by condition
KStream<String, String>[] branches = events.branch(
    (key, value) -> value.contains("error"),   // errors
    (key, value) -> true                        // everything else
);

// Write results to output topics
enriched.to("purchase-events");
branches[0].to("error-events");

Stateful Operations

// Count events per key (e.g., page views per URL)
KTable<String, Long> viewCounts = events
    .groupByKey()
    .count();

// Windowed aggregation: count per 5-minute window
KTable<Windowed<String>, Long> windowedCounts = events
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count();

// Join KStream with KTable (enrich events with user data)
KStream<String, String> enrichedOrders = orders
    .join(userTable, (order, user) -> order + " by " + user);

Key Takeaway: Use filter/map/flatMap for transformations. Use groupByKey().count()/aggregate() for aggregations. Use join() to enrich streams with reference data. State is automatically backed by Kafka topics (changelog topics).

Windowing

Windowing groups events into time-based buckets for aggregation. Kafka Streams supports three window types:

Window Type	Behavior	Use Case
Tumbling	Fixed-size, non-overlapping	"Count per 5 minutes" — each event in exactly one window
Hopping	Fixed-size, overlapping	"5-min window, advance every 1 min" — events in multiple windows
Session	Dynamic, based on activity gap	"Group events until 30 min of inactivity" — user sessions

🔍 Deep Dive: State Stores and Fault Tolerance

Stateful operations (count, aggregate, join) maintain local state in RocksDB on disk. This state is backed by Kafka changelog topics — if your application crashes, a new instance rebuilds the state by replaying the changelog. This is automatic and transparent. You get fault-tolerant, distributed state without an external database.

Kafka Streams Processing Topology

Source

input topic

→

Filter

predicate

→

Map

transform

→

GroupBy

aggregate

→

Sink

output topic

Practice Exercises

Medium Build a Mini Project

Combine concepts from this tutorial to build a small utility or tool.

Medium Debug Challenge

Introduce a bug in one of the code examples and practice finding and fixing it.

Hard Refactoring Exercise

Rewrite one example using a different approach and compare the tradeoffs.