Deduplicate and Aggregate User Actions with Latest Session

PYSPARK coding challenge · Difficulty: hard · +200 XP

DataFrame: user_actions

+-------------+-----------+

| Column      | Type      |

+-------------+-----------+

| user_id     | INT       |
| session_id  | STRING    |
| action_time | TIMESTAMP |
| action_type | STRING    |

+-------------+-----------+

Problem

-------

The logging system inserts DUPLICATE rows

for the same (user_id, session_id) pair.

Step 1 — Deduplicate:

Keep only the row with the LATEST

action_time for each (user_id, session_id)

Step 2 — Aggregate per user_id:

• latest_session_time → MAX(action_time)

• unique_session_count → COUNT of

distinct session_ids

Return: user_id, latest_session_time,

unique_session_count

Order: user_id ASC

Example Input (8 rows, 3 duplicates)

+---------+---------+------------------+------+

| user_id | sess_id | action_time      | type |

+---------+---------+------------------+------+

|   101   |   S1    | 2024-01-01 10:00 | click|
|   101   |   S1    | 2024-01-01 10:05 | scroll← dup
|   101   |   S2    | 2024-01-01 11:00 | click|
|   102   |   S3    | 2024-01-02 09:00 | click|
|   102   |   S3    | 2024-01-02 09:30 | scroll← dup
|   103   |   S4    | 2024-01-03 08:00 | click|
|   103   |   S5    | 2024-01-03 09:00 | scroll|
|   103   |   S5    | 2024-01-03 09:15 | click← dup

+---------+---------+------------------+------+

After Dedup → 5 rows:

S1 keeps 10:05 row, S3 keeps 09:30,

S5 keeps 09:15

Expected Output

+---------+---------------------+----------------------+

| user_id | latest_session_time | unique_session_count |

+---------+---------------------+----------------------+

|   101   | 2024-01-01 11:00:00 |          2           |
|   102   | 2024-01-02 09:30:00 |          1           |
|   103   | 2024-01-03 09:15:00 |          2           |

+---------+---------------------+----------------------+

Hint

----

w = Window.partitionBy('user_id','session_id')

.orderBy(col('action_time').desc())

Solve this challenge on PySpark.in