What is PySpark

1.1 The Big Picture

What problem does PySpark solve?

Imagine you have a massive dataset—say 500 GB of user clickstream data—but your laptop only has 16 GB of RAM. How can you process it? That's exactly the problem PySpark solves.

PySpark is the Python API for Apache Spark, a distributed computing engine. Instead of processing all data on one machine, Spark divides the work across multiple machines in a cluster, with each machine handling a portion of the data simultaneously.

Here's a simple analogy: If you need to sort 1 million books, you have two options:

  • Sort them alone (one person, one machine) — slow and memory-limited
  • Hire 100 people, give each person 10,000 books, have them sort their pile in parallel, then merge the results — fast and scalable

This is distributed computing in a nutshell.

1.2 Pandas vs. PySpark

Pandas works great for small to medium datasets that fit in your machine's RAM. But when your data exceeds your available memory, you need PySpark. Here's the comparison:

Feature

Pandas

PySpark

Data Size

Up to ~10 GB (single machine)

Petabytes (across a cluster)

Processing

Eager (runs immediately)

Lazy (builds a plan first)

Execution

Single machine

Distributed across multiple machines

Use Case

Data analysis, small ETL

Big data ETL, ML at scale

Learning Curve

Easier for beginners

Requires understanding distributed concepts

Speed (small data)

Faster (no overhead)

Slower (cluster overhead)

Speed (big data)

Crashes or very slow

Scales linearly with cluster size

1.3 Spark Architecture (Simplified)

Understanding Spark's architecture helps you write better, more efficient code. Here are the key components:

Driver Program

  • This is your main program—the brain that coordinates everything
  • When you write spark.read.csv(...), the Driver creates an execution plan

Cluster Manager

  • Acts like HR for your cluster
  • Manages resources (CPU, memory) across all machines
  • Options: YARN, Kubernetes, Mesos, or Standalone

Executors

  • The workers in your cluster
  • Each executor runs on a separate machine and processes a chunk of your data
  • They perform the actual computation

Tasks

  • The smallest unit of work
  • Each executor runs multiple tasks in parallel
  • Example: If you have 100 data partitions and 10 executors, each executor handles ~10 tasks

How it works together:

1. You write: df.filter(col('age') > 30).groupBy('city').count()

2. Driver creates an execution plan (called a DAG)

3. Cluster Manager allocates executors

4. Data splits into partitions across executors

5. Each executor processes its partitions in parallel

6. Results are collected back to the Driver

1.4 Lazy Evaluation (The Key Concept)

This is the most important concept in PySpark. Unlike Pandas, PySpark does NOT execute your code line by line. Instead, it builds a plan (called a DAG — Directed Acyclic Graph) and only executes when you explicitly ask for results.

Transformations (Lazy Operations):

  • Define what to do but do NOT execute immediately
  • Examples: filter(), select(), groupBy(), join(), withColumn()
  • These just add steps to the plan

Actions (Trigger Execution):

  • Force Spark to execute the entire plan and return results
  • Examples: show(), count(), collect(), write()
  • These kick off actual computation

Example:

🐍 script.py 42 lines
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# -------------------------------
# 1. Create Spark Session
# -------------------------------
spark = SparkSession.builder \
.appName("PySpark DataFrame Example") \
.getOrCreate()
# -------------------------------
# 2. Create DataFrame (instead of CSV)
# -------------------------------
data = [
("Alice", 30, "HR"),
("Bob", 22, "IT"),
("Charlie", 28, "IT"),
("David", 35, "Finance"),
("Eve", 24, "HR"),
("Frank", 40, "Finance")
]
columns = ["name", "age", "department"]
df = spark.createDataFrame(data, columns)
print("=== Original Data ===")
df.show()
# -------------------------------
# 3. Filter Data
# -------------------------------
filtered = df.filter(col('age') > 25)
print("=== Filtered Data (age > 25) ===")
filtered.show()
# -------------------------------
# 4. Group By Department
# -------------------------------
grouped = filtered.groupBy('department').count()
# -------------------------------
# 5. Trigger Execution
# -------------------------------
print("=== Grouped Data ===")
grouped.show()
# -------------------------------
# 6. Stop Spark Session
# -------------------------------
spark.stop()
▶ Output will appear here.

Why does this matter?

Lazy evaluation lets Spark optimize your entire pipeline before running it. Spark's Catalyst Optimizer might:

  • Reorder operations for efficiency
  • Skip unnecessary columns from being read
  • Push filters down to read less data overall

This optimization is why PySpark can handle massive datasets efficiently.

Best Practice:: Transformations build a plan. Actions execute it. This lazy approach lets Spark optimize intelligently before any computation starts.

VT
Written byVishal Taneja
Knowledge Check

Test Your Understanding

Take this interactive quiz to reinforce what you've learned. Earn badges, track your streak, and master the concepts!

  • 5-10 questions per quiz
  • Earn achievement badges
  • Build answer streaks
  • Track your speed