Real-Time Data

Spark Streaming

Process real-time data streams continuously — from IoT sensors and Kafka topics to live dashboards.

Beginner Friendly Self-Paced Prerequisites: Basic PySpark DataFrame knowledge
Start Learning Spark Streaming

What You'll Learn

  • What streaming data is and when to use streaming vs batch processing
  • Spark Structured Streaming — reading streams with readStream
  • Output modes: append, update, and complete
  • Windowed aggregations — count events in the last 5 minutes
  • Watermarks — handling late-arriving data gracefully
  • Reading from Apache Kafka — the most common streaming source
  • Writing to Delta Lake, console, files, and Kafka
  • Exactly-once semantics — guaranteeing no duplicate processing

Introduction to Spark Streaming

Spark Structured Streaming is the modern streaming API in Apache Spark that lets you process continuous data streams — events arriving every second — using the same DataFrame API you already know from batch processing. Instead of waiting for all data to arrive before processing, streaming processes data as it arrives, enabling real-time pipelines that can update dashboards, detect fraud, or trigger alerts within seconds.

The core abstraction is the "unbounded table" — Spark treats the incoming stream as a table that continuously grows. You write a query just like a batch query (filter, aggregation, join), and Spark automatically processes new data in micro-batches or continuously. This unified programming model means you can reuse 90% of your batch Spark code for streaming.

In production, Spark Structured Streaming is most commonly used with Apache Kafka as the source (millions of events per second) and Delta Lake or a database as the sink. Common use cases include: real-time fraud detection (flag suspicious transactions as they happen), IoT pipeline (process sensor data from thousands of devices), clickstream analytics (track user behaviour in near real-time), and change data capture (replicate database changes to a data lake continuously).

Video Tutorials

Handpicked free YouTube videos to accelerate your understanding

🎧 Playing in English

Spark Structured Streaming — Full Tutorial

Databricks 40 min 🇬🇧 English

Build real-time streaming pipelines with PySpark Structured Streaming — from socket source to Kafka and Delta Lake in production.

🎧 Playing in English

Apache Kafka + PySpark Streaming End-to-End

Data Engineering Simplified 35 min 🇬🇧 English

Connect Apache Kafka to PySpark. Process real-time events with windowing, watermarks, and write results to a Delta Lake table.

Real-time word count from a TCP socket stream

Copy the code below and paste it into your Python environment or our free online compiler.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window, current_timestamp

# 1. Create SparkSession
spark = SparkSession.builder \
    .appName("StreamingWordCount") \
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# 2. Read a streaming source (socket for testing)
#    In production, replace with: spark.readStream.format("kafka")...
lines = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# 3. Add timestamp for windowing
lines_with_time = lines.withColumn("timestamp", current_timestamp())

# 4. Split lines into words
words = lines_with_time.select(
    explode(split(lines_with_time.value, " ")).alias("word"),
    "timestamp"
)

# 5. Windowed aggregation: count each word in 10-second windows
windowed_counts = words \
    .withWatermark("timestamp", "20 seconds") \
    .groupBy(
        window("timestamp", "10 seconds", "5 seconds"),  # 10s window, 5s slide
        "word"
    ) \
    .count() \
    .orderBy("window")

# 6. Write output to console (for testing)
query = windowed_counts.writeStream \
    .outputMode("update")  \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="5 seconds") \
    .start()

# 7. Wait for termination
query.awaitTermination()

# In a separate terminal, run: nc -lk 9999
# Then type words and press Enter to simulate a stream
Want to run this code in your browser — no setup needed? Open Free Compiler →

Key Concepts Explained

Master these terms and you'll understand 80% of the conversations in this field.

readStream

The entry point for Spark Structured Streaming. Returns an unbounded DataFrame that continuously receives new rows. Similar to spark.read but for live streams.

writeStream

Defines the output sink and starts the streaming query. You specify format (console, kafka, delta), outputMode, and trigger interval.

Output Mode

Append: only write new rows (no aggregation). Update: write only updated rows (for aggregations). Complete: write the entire aggregated result each trigger (for small aggregates).

Trigger

Controls when Spark processes the next micro-batch. processingTime="5 seconds" = process every 5 seconds. availableNow = process all available data once then stop (for incremental batch).

Watermark

Tells Spark how long to wait for late data. withWatermark("timestamp", "10 minutes") means Spark waits 10 minutes for late events before finalising window results.

Window

Aggregates over a time range. A 5-minute tumbling window counts events in non-overlapping 5-minute buckets. A sliding window overlaps: a 5-min window sliding every 1 min gives per-minute rolling counts.

Kafka Source/Sink

The most common production streaming source. Kafka topics provide durable, replayable event streams. Spark reads each message as a row with "key", "value", "timestamp", "offset" columns.

Checkpoint

Spark saves the streaming state (offsets, aggregation state) to a checkpoint directory. If the job restarts, it resumes from where it left off — enabling exactly-once guarantees.

Your Spark Streaming Learning Path

Follow these steps in order — each one builds on the last. Designed for complete beginners.

  1. 1

    Spark Batch Basics

    Be comfortable with PySpark DataFrames, SparkSession, and common operations like filter, groupBy, and join before moving to streaming.

  2. 2

    First Streaming Job

    Set up a local socket stream. Run the word count example. Understand readStream, writeStream, and output modes.

  3. 3

    File Source Streaming

    Point readStream at a directory. Spark auto-processes new files as they arrive. Good for incremental landing zones on S3.

  4. 4

    Windowing & Watermarks

    Build a windowed aggregation. Add a watermark to handle late data. Understand the trade-off between latency and correctness.

  5. 5

    Kafka Integration

    Set up a local Kafka broker (Docker). Write a producer. Connect Spark to the Kafka topic and process the stream.

  6. 6

    Delta Lake Streaming Sink

    Write your stream output to a Delta table. Understand how Delta's transaction log enables exactly-once writes.

  7. 7

    Production Deployment

    Run streaming jobs on Databricks or AWS EMR. Configure auto-scaling, checkpointing, and monitoring with Spark UI.

Ready to master Spark Streaming?

Explore our free tutorials, hands-on code examples, and interview questions. No sign-up. No paywalls. Forever free.