Optimize the 100M-Row Join

PYSPARK coding challenge · Difficulty: hard · +200 XP

DataFrames

----------

orders (100M rows):

+------------+---------+

| Column     | Type    |

+------------+---------+

| order_id   | BIGINT  |
| customer_id| INT     |
| amount     | DOUBLE  |

+------------+---------+

customers (1,000 rows — small):

+-------------+---------+

| Column      | Type    |

+-------------+---------+

| customer_id | INT     |
| region      | VARCHAR |

+-------------+---------+

Problem

-------

A production join of orders (100M rows)

with customers (1,000 rows) is running for

40+ minutes causing pipeline SLA breaches.

Optimize it to run in under 2 minutes.

Example Input

-------------

orders (sample):

+----------+-------------+--------+

| order_id | customer_id | amount |

+----------+-------------+--------+

|    1     |    101      | 250.00 |
|    2     |    102      | 300.00 |
|    3     |    101      | 150.00 |

+----------+-------------+--------+

customers (sample):

+-------------+--------+

| customer_id | region |

+-------------+--------+

|     101     | APAC   |
|     102     | US     |

+-------------+--------+

Expected Output

+----------+--------+--------+

| order_id | region | amount |

+----------+--------+--------+

|    1     | APAC   | 250.00 |
|    2     | US     | 300.00 |
|    3     | APAC   | 150.00 |

+----------+--------+--------+

Key Insight

-----------

When one DataFrame is small (<10 MB),

use broadcast join to avoid shuffle:

from pyspark.sql.functions import broadcast

result = orders.join(

broadcast(customers), "customer_id"

)

Constraints

-----------

• Must use broadcast join

• No sort-merge join allowed

• Result rows = total orders count