Optimize Small DataFrame Join with Broadcast
PYSPARK coding challenge · Difficulty: easy · +50 XP
DataFrames
----------
transactions (large):
+----------------+---------+
| Column | Type |
+----------------+---------+
| transaction_id | INT | | product_id | INT | | quantity | INT | | price | DOUBLE |
+----------------+---------+
products (small, ~100 rows):
+------------+---------+
| Column | Type |
+------------+---------+
| product_id | INT | | name | VARCHAR | | category | VARCHAR |
+------------+---------+
Problem
-------
Join transactions with the small products
lookup table.
Use broadcast join to avoid shuffle of
the large transactions DataFrame.
Return: transaction_id, product_name,
category, total_value
(quantity × price)
Order: transaction_id ASC
Example Input
-------------
transactions:
+----+------------+----------+-------+
| id | product_id | quantity | price |
+----+------------+----------+-------+
| 1 | 101 | 3 | 25.00 | | 2 | 102 | 1 | 99.99 | | 3 | 101 | 2 | 25.00 |
+----+------------+----------+-------+
products:
+------------+----------+----------+
| product_id | name | category |
+------------+----------+----------+
| 101 | Widget A | Gadgets | | 102 | Pro Pen | Office |
+------------+----------+----------+
Expected Output
+----+----------+----------+-------+
| id | name | category | total |
+----+----------+----------+-------+
| 1 | Widget A | Gadgets | 75.00 | | 2 | Pro Pen | Office | 99.99 | | 3 | Widget A | Gadgets | 50.00 |
+----+----------+----------+-------+
Hint
----
from pyspark.sql.functions import broadcast
df = transactions.join(
broadcast(products), "product_id"
)