Deduplicate and Aggregate User Actions with Latest Session
PYSPARK coding challenge · Difficulty: hard · +200 XP
DataFrame: user_actions
+-------------+-----------+
| Column | Type |
+-------------+-----------+
| user_id | INT | | session_id | STRING | | action_time | TIMESTAMP | | action_type | STRING |
+-------------+-----------+
Problem
-------
The logging system inserts DUPLICATE rows
for the same (user_id, session_id) pair.
Step 1 — Deduplicate:
Keep only the row with the LATEST
action_time for each (user_id, session_id)
Step 2 — Aggregate per user_id:
• latest_session_time → MAX(action_time)
• unique_session_count → COUNT of
distinct session_ids
Return: user_id, latest_session_time,
unique_session_count
Order: user_id ASC
Example Input (8 rows, 3 duplicates)
+---------+---------+------------------+------+
| user_id | sess_id | action_time | type |
+---------+---------+------------------+------+
| 101 | S1 | 2024-01-01 10:00 | click| | 101 | S1 | 2024-01-01 10:05 | scroll← dup | 101 | S2 | 2024-01-01 11:00 | click| | 102 | S3 | 2024-01-02 09:00 | click| | 102 | S3 | 2024-01-02 09:30 | scroll← dup | 103 | S4 | 2024-01-03 08:00 | click| | 103 | S5 | 2024-01-03 09:00 | scroll| | 103 | S5 | 2024-01-03 09:15 | click← dup
+---------+---------+------------------+------+
After Dedup → 5 rows:
S1 keeps 10:05 row, S3 keeps 09:30,
S5 keeps 09:15
Expected Output
+---------+---------------------+----------------------+
| user_id | latest_session_time | unique_session_count |
+---------+---------------------+----------------------+
| 101 | 2024-01-01 11:00:00 | 2 | | 102 | 2024-01-02 09:30:00 | 1 | | 103 | 2024-01-03 09:15:00 | 2 |
+---------+---------------------+----------------------+
Hint
----
w = Window.partitionBy('user_id','session_id')
.orderBy(col('action_time').desc())