Optimize Small DataFrame Join with Broadcast

PYSPARK coding challenge · Difficulty: easy · +50 XP

DataFrames

----------

transactions (large):

+----------------+---------+

| Column         | Type    |

+----------------+---------+

| transaction_id | INT     |
| product_id     | INT     |
| quantity       | INT     |
| price          | DOUBLE  |

+----------------+---------+

products (small, ~100 rows):

+------------+---------+

| Column     | Type    |

+------------+---------+

| product_id | INT     |
| name       | VARCHAR |
| category   | VARCHAR |

+------------+---------+

Problem

-------

Join transactions with the small products

lookup table.

Use broadcast join to avoid shuffle of

the large transactions DataFrame.

Return: transaction_id, product_name,

category, total_value

(quantity × price)

Order: transaction_id ASC

Example Input

-------------

transactions:

+----+------------+----------+-------+

| id | product_id | quantity | price |

+----+------------+----------+-------+

|  1 |    101     |    3     | 25.00 |
|  2 |    102     |    1     | 99.99 |
|  3 |    101     |    2     | 25.00 |

+----+------------+----------+-------+

products:

+------------+----------+----------+

| product_id | name     | category |

+------------+----------+----------+

|    101     | Widget A | Gadgets  |
|    102     | Pro Pen  | Office   |

+------------+----------+----------+

Expected Output

+----+----------+----------+-------+

| id | name     | category | total |

+----+----------+----------+-------+

|  1 | Widget A | Gadgets  | 75.00 |
|  2 | Pro Pen  | Office   | 99.99 |
|  3 | Widget A | Gadgets  | 50.00 |

+----+----------+----------+-------+

Hint

----

from pyspark.sql.functions import broadcast

df = transactions.join(

broadcast(products), "product_id"

)

Solve this challenge on PySpark.in