Optimize Average Rating Calculation for Products
PYSPARK coding challenge · Difficulty: medium · +100 XP
DataFrame: reviews
+-------------+---------+
| Column | Type |
+-------------+---------+
| review_id | INT | | product_id | INT | | rating | INT | | review_date | DATE |
+-------------+---------+
rating is between 1 and 5 (inclusive).
Problem
-------
For each product, compute:
• avg_rating → average of all ratings
(rounded to 2 decimals)
• review_count → total number of reviews
• latest_review→ most recent review_date
Only include products with at least
3 reviews.
Return: product_id, avg_rating,
review_count, latest_review
Order: avg_rating DESC, product_id ASC
Example Input
-------------
+-----------+------------+--------+------------+
| review_id | product_id | rating | date |
+-----------+------------+--------+------------+
| 1 | 201 | 5 | 2024-01-10 | | 2 | 201 | 4 | 2024-02-15 | | 3 | 201 | 3 | 2024-03-20 | | 4 | 202 | 5 | 2024-01-05 | | 5 | 202 | 5 | 2024-02-10 | | 6 | 202 | 4 | 2024-03-15 | | 7 | 203 | 2 | 2024-01-20 | | 8 | 203 | 3 | 2024-02-25 |
+-----------+------------+--------+------------+
Expected Output
+------------+------------+--------------+---------------+
| product_id | avg_rating | review_count | latest_review |
+------------+------------+--------------+---------------+
| 202 | 4.67 | 3 | 2024-03-15 | | 201 | 4.00 | 3 | 2024-03-20 |
+------------+------------+--------------+---------------+
Explanation
-----------
Product 203: only 2 reviews → excluded
Product 202: avg = (5+5+4)/3 = 4.67
Product 201: avg = (5+4+3)/3 = 4.00