Spark Streaming
Process real-time data streams continuously — from IoT sensors and Kafka topics to live dashboards.
Start Learning Spark StreamingWhat You'll Learn
- What streaming data is and when to use streaming vs batch processing
- Spark Structured Streaming — reading streams with readStream
- Output modes: append, update, and complete
- Windowed aggregations — count events in the last 5 minutes
- Watermarks — handling late-arriving data gracefully
- Reading from Apache Kafka — the most common streaming source
- Writing to Delta Lake, console, files, and Kafka
- Exactly-once semantics — guaranteeing no duplicate processing
Introduction to Spark Streaming
Spark Structured Streaming is the modern streaming API in Apache Spark that lets you process continuous data streams — events arriving every second — using the same DataFrame API you already know from batch processing. Instead of waiting for all data to arrive before processing, streaming processes data as it arrives, enabling real-time pipelines that can update dashboards, detect fraud, or trigger alerts within seconds.
The core abstraction is the "unbounded table" — Spark treats the incoming stream as a table that continuously grows. You write a query just like a batch query (filter, aggregation, join), and Spark automatically processes new data in micro-batches or continuously. This unified programming model means you can reuse 90% of your batch Spark code for streaming.
In production, Spark Structured Streaming is most commonly used with Apache Kafka as the source (millions of events per second) and Delta Lake or a database as the sink. Common use cases include: real-time fraud detection (flag suspicious transactions as they happen), IoT pipeline (process sensor data from thousands of devices), clickstream analytics (track user behaviour in near real-time), and change data capture (replicate database changes to a data lake continuously).
Video Tutorials
Handpicked free YouTube videos to accelerate your understanding
Spark Structured Streaming — Full Tutorial
Build real-time streaming pipelines with PySpark Structured Streaming — from socket source to Kafka and Delta Lake in production.
Apache Kafka + PySpark Streaming End-to-End
Connect Apache Kafka to PySpark. Process real-time events with windowing, watermarks, and write results to a Delta Lake table.
Real-time word count from a TCP socket stream
Copy the code below and paste it into your Python environment or our free online compiler.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window, current_timestamp
# 1. Create SparkSession
spark = SparkSession.builder \
.appName("StreamingWordCount") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
# 2. Read a streaming source (socket for testing)
# In production, replace with: spark.readStream.format("kafka")...
lines = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# 3. Add timestamp for windowing
lines_with_time = lines.withColumn("timestamp", current_timestamp())
# 4. Split lines into words
words = lines_with_time.select(
explode(split(lines_with_time.value, " ")).alias("word"),
"timestamp"
)
# 5. Windowed aggregation: count each word in 10-second windows
windowed_counts = words \
.withWatermark("timestamp", "20 seconds") \
.groupBy(
window("timestamp", "10 seconds", "5 seconds"), # 10s window, 5s slide
"word"
) \
.count() \
.orderBy("window")
# 6. Write output to console (for testing)
query = windowed_counts.writeStream \
.outputMode("update") \
.format("console") \
.option("truncate", False) \
.trigger(processingTime="5 seconds") \
.start()
# 7. Wait for termination
query.awaitTermination()
# In a separate terminal, run: nc -lk 9999
# Then type words and press Enter to simulate a streamKey Concepts Explained
Master these terms and you'll understand 80% of the conversations in this field.
readStream
The entry point for Spark Structured Streaming. Returns an unbounded DataFrame that continuously receives new rows. Similar to spark.read but for live streams.
writeStream
Defines the output sink and starts the streaming query. You specify format (console, kafka, delta), outputMode, and trigger interval.
Output Mode
Append: only write new rows (no aggregation). Update: write only updated rows (for aggregations). Complete: write the entire aggregated result each trigger (for small aggregates).
Trigger
Controls when Spark processes the next micro-batch. processingTime="5 seconds" = process every 5 seconds. availableNow = process all available data once then stop (for incremental batch).
Watermark
Tells Spark how long to wait for late data. withWatermark("timestamp", "10 minutes") means Spark waits 10 minutes for late events before finalising window results.
Window
Aggregates over a time range. A 5-minute tumbling window counts events in non-overlapping 5-minute buckets. A sliding window overlaps: a 5-min window sliding every 1 min gives per-minute rolling counts.
Kafka Source/Sink
The most common production streaming source. Kafka topics provide durable, replayable event streams. Spark reads each message as a row with "key", "value", "timestamp", "offset" columns.
Checkpoint
Spark saves the streaming state (offsets, aggregation state) to a checkpoint directory. If the job restarts, it resumes from where it left off — enabling exactly-once guarantees.
Your Spark Streaming Learning Path
Follow these steps in order — each one builds on the last. Designed for complete beginners.
- 1
Spark Batch Basics
Be comfortable with PySpark DataFrames, SparkSession, and common operations like filter, groupBy, and join before moving to streaming.
- 2
First Streaming Job
Set up a local socket stream. Run the word count example. Understand readStream, writeStream, and output modes.
- 3
File Source Streaming
Point readStream at a directory. Spark auto-processes new files as they arrive. Good for incremental landing zones on S3.
- 4
Windowing & Watermarks
Build a windowed aggregation. Add a watermark to handle late data. Understand the trade-off between latency and correctness.
- 5
Kafka Integration
Set up a local Kafka broker (Docker). Write a producer. Connect Spark to the Kafka topic and process the stream.
- 6
Delta Lake Streaming Sink
Write your stream output to a Delta table. Understand how Delta's transaction log enables exactly-once writes.
- 7
Production Deployment
Run streaming jobs on Databricks or AWS EMR. Configure auto-scaling, checkpointing, and monitoring with Spark UI.
Ready to master Spark Streaming?
Explore our free tutorials, hands-on code examples, and interview questions. No sign-up. No paywalls. Forever free.