Skip to main content
Version: 2.0

Road to Delta Lake

Presentation Content (to follow along)

Preface

not-everything-is-big-data.png

We’re here to teach you big data skills, but in reality...

Single-Node vs. Cluster: not everything is Big Data!

When it comes to Batch vs Streaming, streaming isn’t always the solution (Bonus)!

History of Big Data - Revisited

Databases

  • single node servers, vertical scaling only
  • mostly used for operational data, transactions

On-Prem Data Warehouses

  • horizontal scaling possible
  • utilized massively parallel processing (MPP) to run big queries
  • SQL-only interface, low interoperability

On-Prem Data Lakes

  • Hadoop ecosystem
  • Data Processing via MapReduce, Hive, Spark, etc.
  • Difficult for data governance and data integrity

Some Problems with Big Data Workloads

ACID Transactions

  • ACID transactions
  • Corrupt/incomplete data
  • Editability, schema evolution
  • GDPR requests
  • Data versioning and auditability (very important for getting Machine Learning models approved by regulators)

Some More Problems with Big Data Workloads

Batch vs Streaming conundrum

  • Lambda Architecture: Having to maintain different code bases and data stores for batch and for streaming
  • Kappa Architecture: play everything through the streaming pipeline...but your retention period is limited and it’s a pain to manage, query, edit
    • Moral of the story: streaming/Kappa architecture isn’t a silver bullet

Modern Solutions

  • This presentation is actually from early/mid 2019 (ignore the upload date on YouTube),
  • Delta Lake’s gotten even better since then!
  • Executive Summary:
    • Delta Lake is just Apache Parquet with a transaction log on top!
    • Keep using your preferred storage technology (e.g. S3, Azure Data Lake Storage)
    • Allows fine-grained updates, deletes to big data (GDPR)
    • Allows data versioning, time travel, rollbacks, audit history
    • Optimizes the layout of your underlying parquet files (via OPTIMIZE) for query performance

Delta Lake Key Features

Link

  • ACID Transactions

    • Ensures data integrity, using the transaction log
  • Updates, Deletes, Upserts

    • Perform Data Manipulation Language (DML) commands such as updates/deletes/upserts
    • Important for GDPR, improving/maintaining data quality of existing tables
  • Advanced Metadata handling

    • Enhanced capturing and utilization of metadata
    • Improves query performance compared to standard Parquet
  • Schema Enforcement & Evolution

  • Demo video

Extra Reading (Bonus)