Parse Apache Logs with Regex
PYSPARK coding challenge · Difficulty: medium · +120 XP
DataFrame: raw_logs
+--------+---------+
| Column | Type |
+--------+---------+
| line | STRING |
+--------+---------+
Problem
-------
Parse raw Apache access log lines and
extract structured fields using regex.
Log format:
IP - - [DATE] "METHOD /path HTTP/1.1"
STATUS_CODE BYTES
Extract these columns:
ip, method, path, status, bytes
Then return only rows where:
status >= 400 (errors/failures)
Order: status ASC, ip ASC
Example Input
-------------
line (3 rows):
+--------------------------------------------------+
| 192.168.1.1 - - [01/Jan/2024] "GET /api HTTP/1.1"| | " 200 1234 | | 10.0.0.1 - - [01/Jan/2024] "POST /login HTTP/1.1"| | " 404 567 | | 172.16.0.5 - - [02/Jan/2024] "GET /x HTTP/1.1" | | " 500 89 |
+--------------------------------------------------+
Expected Output (status >= 400 only)
+-----------+------+--------+--------+-------+
| ip | meth | path | status | bytes |
+-----------+------+--------+--------+-------+
| 10.0.0.1 | POST | /login | 404 | 567 | | 172.16.0.5| GET | /x | 500 | 89 |
+-----------+------+--------+--------+-------+
Hint
----
regex pattern:
r'(\d+\.\d+\.\d+\.\d+).*?"(\w+)
\s+(\S+)\s+HTTP.*?"\s+(\d+)\s+(\d+)'