Parse Apache Logs with Regex

PYSPARK coding challenge · Difficulty: medium · +120 XP

DataFrame: raw_logs

+--------+---------+

| Column | Type    |

+--------+---------+

| line   | STRING  |

+--------+---------+

Problem

-------

Parse raw Apache access log lines and

extract structured fields using regex.

Log format:

IP - - [DATE] "METHOD /path HTTP/1.1"

STATUS_CODE BYTES

Extract these columns:

ip, method, path, status, bytes

Then return only rows where:

status >= 400 (errors/failures)

Order: status ASC, ip ASC

Example Input

-------------

line (3 rows):

+--------------------------------------------------+

| 192.168.1.1 - - [01/Jan/2024] "GET /api HTTP/1.1"|
| " 200 1234                                        |
| 10.0.0.1 - - [01/Jan/2024] "POST /login HTTP/1.1"|
| " 404 567                                         |
| 172.16.0.5 - - [02/Jan/2024] "GET /x HTTP/1.1"   |
| " 500 89                                          |

+--------------------------------------------------+

Expected Output (status >= 400 only)

+-----------+------+--------+--------+-------+

| ip        | meth | path   | status | bytes |

+-----------+------+--------+--------+-------+

| 10.0.0.1  | POST | /login |  404   |  567  |
| 172.16.0.5| GET  | /x     |  500   |   89  |

+-----------+------+--------+--------+-------+

Hint

----

regex pattern:

r'(\d+\.\d+\.\d+\.\d+).*?"(\w+)

\s+(\S+)\s+HTTP.*?"\s+(\d+)\s+(\d+)'

Solve this challenge on PySpark.in