Optimize Average Rating Calculation for Products

PYSPARK coding challenge · Difficulty: medium · +100 XP

DataFrame: reviews

+-------------+---------+

| Column      | Type    |

+-------------+---------+

| review_id   | INT     |
| product_id  | INT     |
| rating      | INT     |
| review_date | DATE    |

+-------------+---------+

rating is between 1 and 5 (inclusive).

Problem

-------

For each product, compute:

• avg_rating → average of all ratings

(rounded to 2 decimals)

• review_count → total number of reviews

• latest_review→ most recent review_date

Only include products with at least

3 reviews.

Return: product_id, avg_rating,

review_count, latest_review

Order: avg_rating DESC, product_id ASC

Example Input

-------------

+-----------+------------+--------+------------+

| review_id | product_id | rating | date       |

+-----------+------------+--------+------------+

|     1     |    201     |   5    | 2024-01-10 |
|     2     |    201     |   4    | 2024-02-15 |
|     3     |    201     |   3    | 2024-03-20 |
|     4     |    202     |   5    | 2024-01-05 |
|     5     |    202     |   5    | 2024-02-10 |
|     6     |    202     |   4    | 2024-03-15 |
|     7     |    203     |   2    | 2024-01-20 |
|     8     |    203     |   3    | 2024-02-25 |

+-----------+------------+--------+------------+

Expected Output

+------------+------------+--------------+---------------+

| product_id | avg_rating | review_count | latest_review |

+------------+------------+--------------+---------------+

|    202     |    4.67    |      3       |  2024-03-15   |
|    201     |    4.00    |      3       |  2024-03-20   |

+------------+------------+--------------+---------------+

Explanation

-----------

Product 203: only 2 reviews → excluded

Product 202: avg = (5+5+4)/3 = 4.67

Product 201: avg = (5+4+3)/3 = 4.00

Solve this challenge on PySpark.in