Fix the Broken Pipeline
PYSPARK coding challenge · Difficulty: medium · +75 XP
Table: orders
+------------+---------+
| Column | Type |
+------------+---------+
| order_id | INT | | customer_id| INT | | amount | DOUBLE |
+------------+---------+
Table: customers
+-------------+---------+
| Column | Type |
+-------------+---------+
| customer_id | INT | | name | VARCHAR |
+-------------+---------+
Problem
-------
Your ETL pipeline joins orders with
customers. A bug produces duplicate rows —
every order with customer_id=1 appears
twice, inflating revenue on dashboards.
Fix the join so each order appears exactly
once.
Example Input
-------------
orders (5 rows):
+----------+-------------+--------+
| order_id | customer_id | amount |
+----------+-------------+--------+
| 1 | 1 | 250.0 | | 2 | 1 | 300.0 | | 3 | 2 | 150.0 | | 4 | 3 | 450.0 | | 5 | 1 | 100.0 |
+----------+-------------+--------+
customers (3 rows):
+-------------+-------+
| customer_id | name |
+-------------+-------+
| 1 | Alice | | 2 | Bob | | 3 | Carol |
+-------------+-------+
Expected Output (5 rows, no duplicates)
+----------+-------+--------+
| order_id | name | amount |
+----------+-------+--------+
| 1 | Alice | 250.0 | | 2 | Alice | 300.0 | | 3 | Bob | 150.0 | | 4 | Carol | 450.0 | | 5 | Alice | 100.0 |
+----------+-------+--------+
Constraints
-----------
• Use a single join — no cross join
• Result must have exactly 5 rows
• Order by order_id ASC