Modern Data Lake

Delta Lake

Add ACID transactions, time travel, and schema enforcement to your data lake — built on top of Parquet and Spark.

Beginner Friendly Self-Paced Prerequisites: Basic PySpark / DataFrame knowledge
Start Learning Delta Lake

What You'll Learn

  • Why traditional data lakes lack reliability and what Delta Lake fixes
  • How to create, read, and write Delta tables with PySpark
  • ACID transactions — how Delta prevents data corruption
  • Time travel — query historical versions of your data
  • Schema enforcement and schema evolution
  • MERGE (upsert) — updating records in a data lake at scale
  • Delta Lake optimisations: Z-ordering, OPTIMIZE, and VACUUM
  • Delta Live Tables (DLT) — declarative data pipelines on Databricks

Introduction to Delta Lake

Delta Lake is an open-source storage layer that brings database-like reliability to data lakes. Traditional data lakes (Parquet files on S3 or HDFS) have a major problem: they don't support ACID transactions. If your Spark job fails halfway through writing, you get a corrupt, half-written dataset. If two jobs write simultaneously, you get data corruption. Delta Lake solves all of this by adding a transaction log on top of Parquet files.

Delta Lake's three killer features are: ACID transactions (write operations either fully succeed or fully roll back, like a database), Time Travel (query data as it existed at any past point in time — last week, last month, any version number), and Schema Enforcement (Delta rejects data that doesn't match the table schema, preventing silent data quality issues). These make it the standard format for production data lakes at thousands of companies.

Delta Lake is the foundation of the Databricks Lakehouse architecture and is natively supported in Apache Spark 3.x. It stores data as Parquet files plus a _delta_log/ directory containing JSON transaction logs. The MERGE operation (upsert) is one of its most powerful features — you can update millions of records in a data lake with the same simplicity as a database UPDATE statement.

Video Tutorials

Handpicked free YouTube videos to accelerate your understanding

🎧 Playing in English

Delta Lake Tutorial for Beginners

Databricks 45 min 🇬🇧 English

Official Databricks intro to Delta Lake — ACID transactions, time travel, schema enforcement, and the Lakehouse architecture explained.

🎧 Playing in English

Delta Lake MERGE, Time Travel & OPTIMIZE

Data with Zach 30 min 🇬🇧 English

Hands-on Delta Lake operations: upserts with MERGE INTO, querying historical data with time travel, and OPTIMIZE with Z-ORDER.

Delta Lake basics — write, time travel, and MERGE

Copy the code below and paste it into your Python environment or our free online compiler.

python
# Install: pip install delta-spark pyspark

from pyspark.sql import SparkSession
from delta.tables import DeltaTable

# 1. Create SparkSession with Delta Lake support
spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 2. Create a Delta table from a DataFrame
data = [(1, "Alice", 50000), (2, "Bob", 60000), (3, "Carol", 55000)]
df = spark.createDataFrame(data, ["id", "name", "salary"])
df.write.format("delta").mode("overwrite").save("/tmp/employees")

# 3. Read the Delta table
delta_df = spark.read.format("delta").load("/tmp/employees")
delta_df.show()

# 4. Update a record (ACID transaction!)
dt = DeltaTable.forPath(spark, "/tmp/employees")
dt.update(
    condition="id = 2",
    set={"salary": "70000"}
)

# 5. TIME TRAVEL — read the original version (before update)
original = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/tmp/employees")
original.show()   # Bob still shows salary = 60000!

# 6. MERGE (upsert) — update existing rows, insert new ones
new_data = [(2, "Bob", 75000), (4, "Dave", 65000)]
new_df = spark.createDataFrame(new_data, ["id", "name", "salary"])

dt.alias("existing").merge(
    new_df.alias("updates"),
    "existing.id = updates.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# 7. View Delta table history
dt.history().select("version", "timestamp", "operation").show()

# 8. Optimise (compact small files + Z-order for faster queries)
spark.sql("OPTIMIZE delta.`/tmp/employees` ZORDER BY (id)")
Want to run this code in your browser — no setup needed? Open Free Compiler →

Key Concepts Explained

Master these terms and you'll understand 80% of the conversations in this field.

ACID Transactions

Atomicity (all-or-nothing), Consistency, Isolation, Durability. Delta Lake records every write in a transaction log. If a Spark job fails mid-write, the partial data is invisible — readers always see a consistent state.

Transaction Log (_delta_log)

A directory of JSON files recording every change (add file, remove file, update metadata) made to the table. The source of truth for time travel and ACID guarantees.

Time Travel

Query the table as it was at a previous version or timestamp: .option("versionAsOf", 5) or .option("timestampAsOf", "2024-01-01"). Used for auditing, debugging, and reproducibility.

Schema Enforcement

Delta Lake rejects writes that don't match the table's schema. If your table has column "salary" as INT and you try to write a string, the write fails — preventing silent corruption.

Schema Evolution

When you need to add new columns, use .option("mergeSchema", "true") to automatically extend the table schema. Existing rows get NULL for the new columns.

MERGE (Upsert)

A single SQL/API operation that updates matching rows and inserts new ones. Replaces the brittle "read-modify-write" pattern. Critical for streaming updates to a data lake.

Z-Ordering

A data skipping optimisation that co-locates related data in the same Parquet files based on one or more columns. Speeds up queries that filter on those columns by 10-100x.

VACUUM

Deletes Parquet files that are no longer referenced by the transaction log (old versions). Run periodically to reclaim storage: VACUUM delta.`/path/to/table` RETAIN 168 HOURS.

Your Delta Lake Learning Path

Follow these steps in order — each one builds on the last. Designed for complete beginners.

  1. 1

    Parquet & Data Lakes

    Understand Parquet file format, why data lakes use object storage (S3/ADLS), and the problems that arise without ACID transactions.

  2. 2

    PySpark DataFrames

    Be comfortable reading and writing data with Spark. Understand partitioning and the write modes (overwrite, append, ignore).

  3. 3

    Delta Lake Basics

    Create your first Delta table. Read, write, and update. Inspect the _delta_log directory to understand the transaction log.

  4. 4

    Time Travel & History

    Practice querying past versions with versionAsOf and timestampAsOf. Use dt.history() to audit changes.

  5. 5

    MERGE Operations

    Build an incremental pipeline that merges daily updates into a Delta table. Handle updates, inserts, and deletes in one operation.

  6. 6

    Performance Optimisation

    Run OPTIMIZE and ZORDER. Understand liquid clustering (Delta 3.x). Use VACUUM to manage storage.

  7. 7

    Streaming + Delta

    Use Spark Structured Streaming to write a real-time stream into a Delta table. Enable Change Data Feed (CDF) to track row-level changes.

Ready to master Delta Lake?

Explore our free tutorials, hands-on code examples, and interview questions. No sign-up. No paywalls. Forever free.