Fix the Broken Pipeline

PYSPARK coding challenge · Difficulty: medium · +75 XP

Table: orders

+------------+---------+

| Column     | Type    |

+------------+---------+

| order_id   | INT     |
| customer_id| INT     |
| amount     | DOUBLE  |

+------------+---------+

Table: customers

+-------------+---------+

| Column      | Type    |

+-------------+---------+

| customer_id | INT     |
| name        | VARCHAR |

+-------------+---------+

Problem

-------

Your ETL pipeline joins orders with

customers. A bug produces duplicate rows —

every order with customer_id=1 appears

twice, inflating revenue on dashboards.

Fix the join so each order appears exactly

once.

Example Input

-------------

orders (5 rows):

+----------+-------------+--------+

| order_id | customer_id | amount |

+----------+-------------+--------+

|    1     |      1      | 250.0  |
|    2     |      1      | 300.0  |
|    3     |      2      | 150.0  |
|    4     |      3      | 450.0  |
|    5     |      1      | 100.0  |

+----------+-------------+--------+

customers (3 rows):

+-------------+-------+

| customer_id | name  |

+-------------+-------+

|      1      | Alice |
|      2      | Bob   |
|      3      | Carol |

+-------------+-------+

Expected Output (5 rows, no duplicates)

+----------+-------+--------+

| order_id | name  | amount |

+----------+-------+--------+

|    1     | Alice | 250.0  |
|    2     | Alice | 300.0  |
|    3     | Bob   | 150.0  |
|    4     | Carol | 450.0  |
|    5     | Alice | 100.0  |

+----------+-------+--------+

Constraints

-----------

• Use a single join — no cross join

• Result must have exactly 5 rows

• Order by order_id ASC

Solve this challenge on PySpark.in