Python + PySparkIntermediate

Log Analysis Dashboard

Parse raw Apache web server logs, transform them into a structured Spark DataFrame, and extract actionable insights — all in one runnable script.

15 log records 7 analyses Runnable in browser

Project Purpose

Web servers generate thousands of log entries every minute. This project teaches you how to use PySpark's distributed engine to process those logs efficiently — a skill that scales from a few KB of sample data to terabytes in production.

You will learn how to: parse unstructured text with regex, build a typed Spark DataFrame, and run aggregations to answer real operational questions like "which endpoints are slowest?" or "is our error rate acceptable?".

Regex ParsingRDD → DataFrameGroupBy & AggregationFilter & CountSchema DefinitionError Rate Analysis

Dataset — Apache Log Format

Embedded directly in the code — no file upload needed

Each line follows the Combined Log Format used by Apache and Nginx:

IP - - [timestamp] "METHOD endpoint HTTP/1.1" status_code bytes

Sample records from the embedded dataset:

192.168.1.1 - - [01/Jan/2024:10:00:01] "GET /home HTTP/1.1" 200 1024

192.168.1.2 - - [01/Jan/2024:10:00:02] "POST /login HTTP/1.1" 401 512

10.0.0.5 - - [01/Jan/2024:10:00:03] "GET /products HTTP/1.1" 200 2048

10.0.0.7 - - [01/Jan/2024:10:00:05] "DELETE /user/42 HTTP/1.1" 403 256

192.168.1.1 - - [01/Jan/2024:10:00:09] "GET /dashboard HTTP/1.1" 500 0

192.168.1.2 - - [01/Jan/2024:10:00:14] "POST /api/submit HTTP/1.1" 500 0

... (15 records total embedded in the code)

15Total Records

4Unique IPs

5Unique Endpoints

4Status Codes

How It Works — Step by Step

Step 1 — Parse Raw Logs

Use a regex pattern to extract IP address, timestamp, HTTP method, endpoint, status code, and bytes from each raw log line into structured columns.

Step 2 — Build a Spark DataFrame

Convert the parsed RDD into a typed Spark DataFrame with an explicit schema. This enables SQL-style querying and distributed processing.

Step 3 — Status Code Distribution

Group by status_code and count occurrences to see how many 200 OK, 404 Not Found, 500 Server Error, etc. requests your server received.

Step 4 — Error Rate

Filter rows where status_code >= 400, count them, and calculate the error percentage. High error rates indicate server or client problems.

Step 5 — Top IPs

Identify the IP addresses making the most requests. This can reveal heavy users, bots, or potential DDoS sources.

Step 6 — Popular Endpoints

Rank your URLs by visit count to understand which pages or API routes get the most traffic.

Step 7 — Data Transferred

Sum the bytes column across all requests to measure total bandwidth consumed — useful for capacity planning.

Run the Project

Edit the code below, then click Execute

Output

Output will appear here after you click Execute...