Log Analysis Dashboard
Parse raw Apache web server logs, transform them into a structured Spark DataFrame, and extract actionable insights — all in one runnable script.
Web servers generate thousands of log entries every minute. This project teaches you how to use PySpark's distributed engine to process those logs efficiently — a skill that scales from a few KB of sample data to terabytes in production.
You will learn how to: parse unstructured text with regex, build a typed Spark DataFrame, and run aggregations to answer real operational questions like "which endpoints are slowest?" or "is our error rate acceptable?".
Each line follows the Combined Log Format used by Apache and Nginx:
IP - - [timestamp] "METHOD endpoint HTTP/1.1" status_code bytesSample records from the embedded dataset:
Use a regex pattern to extract IP address, timestamp, HTTP method, endpoint, status code, and bytes from each raw log line into structured columns.
Convert the parsed RDD into a typed Spark DataFrame with an explicit schema. This enables SQL-style querying and distributed processing.
Group by status_code and count occurrences to see how many 200 OK, 404 Not Found, 500 Server Error, etc. requests your server received.
Filter rows where status_code >= 400, count them, and calculate the error percentage. High error rates indicate server or client problems.
Identify the IP addresses making the most requests. This can reveal heavy users, bots, or potential DDoS sources.
Rank your URLs by visit count to understand which pages or API routes get the most traffic.
Sum the bytes column across all requests to measure total bandwidth consumed — useful for capacity planning.
Output will appear here after you click Execute...