Optimize the 100M-Row Join
PYSPARK coding challenge · Difficulty: hard · +200 XP
DataFrames
----------
orders (100M rows):
+------------+---------+
| Column | Type |
+------------+---------+
| order_id | BIGINT | | customer_id| INT | | amount | DOUBLE |
+------------+---------+
customers (1,000 rows — small):
+-------------+---------+
| Column | Type |
+-------------+---------+
| customer_id | INT | | region | VARCHAR |
+-------------+---------+
Problem
-------
A production join of orders (100M rows)
with customers (1,000 rows) is running for
40+ minutes causing pipeline SLA breaches.
Optimize it to run in under 2 minutes.
Example Input
-------------
orders (sample):
+----------+-------------+--------+
| order_id | customer_id | amount |
+----------+-------------+--------+
| 1 | 101 | 250.00 | | 2 | 102 | 300.00 | | 3 | 101 | 150.00 |
+----------+-------------+--------+
customers (sample):
+-------------+--------+
| customer_id | region |
+-------------+--------+
| 101 | APAC | | 102 | US |
+-------------+--------+
Expected Output
+----------+--------+--------+
| order_id | region | amount |
+----------+--------+--------+
| 1 | APAC | 250.00 | | 2 | US | 300.00 | | 3 | APAC | 150.00 |
+----------+--------+--------+
Key Insight
-----------
When one DataFrame is small (<10 MB),
use broadcast join to avoid shuffle:
from pyspark.sql.functions import broadcast
result = orders.join(
broadcast(customers), "customer_id"
)
Constraints
-----------
• Must use broadcast join
• No sort-merge join allowed
• Result rows = total orders count