Skip to main content

Version: 2.0

Pawarit Laosunthara

Curator

Curator

Road to Delta Lake

Presentation Content (to follow along)

Preface

We’re here to teach you big data skills, but in reality...

Single-Node vs. Cluster: not everything is Big Data!

You don’t always need Spark (Bonus)

When it comes to Batch vs Streaming, streaming isn’t always the solution (Bonus)!

History of Big Data - Revisited

Databases

single node servers, vertical scaling only
mostly used for operational data, transactions

On-Prem Data Warehouses

horizontal scaling possible
utilized massively parallel processing (MPP) to run big queries
SQL-only interface, low interoperability

On-Prem Data Lakes

Hadoop ecosystem
Data Processing via MapReduce, Hive, Spark, etc.
Difficult for data governance and data integrity

Some Problems with Big Data Workloads

ACID Transactions

ACID transactions
Corrupt/incomplete data
Editability, schema evolution
GDPR requests
Data versioning and auditability (very important for getting Machine Learning models approved by regulators)

Some More Problems with Big Data Workloads

Batch vs Streaming conundrum

Lambda Architecture: Having to maintain different code bases and data stores for batch and for streaming

Kappa Architecture: play everything through the streaming pipeline...but your retention period is limited and it’s a pain to manage, query, edit
- Moral of the story: streaming/Kappa architecture isn’t a silver bullet

Modern Solutions

This presentation is actually from early/mid 2019 (ignore the upload date on YouTube),
Delta Lake’s gotten even better since then!
Executive Summary:
- Delta Lake is just Apache Parquet with a transaction log on top!
- Keep using your preferred storage technology (e.g. S3, Azure Data Lake Storage)
- Allows fine-grained updates, deletes to big data (GDPR)
- Allows data versioning, time travel, rollbacks, audit history
- Optimizes the layout of your underlying parquet files (via OPTIMIZE) for query performance

Delta Lake Key Features

ACID Transactions
- Ensures data integrity, using the transaction log
Updates, Deletes, Upserts
- Perform Data Manipulation Language (DML) commands such as updates/deletes/upserts
- Important for GDPR, improving/maintaining data quality of existing tables
Advanced Metadata handling
- Enhanced capturing and utilization of metadata
- Improves query performance compared to standard Parquet
Schema Enforcement & Evolution
Demo video

Extra Reading (Bonus)

Making GDPR work in Delta Lake

Presentation Content (to follow along)
Some More Problems with Big Data Workloads
Modern Solutions
Delta Lake Key Features
Extra Reading (Bonus)