Road to Delta Lake
Presentation Content (to follow along)
Preface
We’re here to teach you big data skills, but in reality...
Single-Node vs. Cluster: not everything is Big Data!
When it comes to Batch vs Streaming, streaming isn’t always the solution (Bonus)!
History of Big Data - Revisited
Databases
- single node servers, vertical scaling only
- mostly used for operational data, transactions
On-Prem Data Warehouses
- horizontal scaling possible
- utilized massively parallel processing (MPP) to run big queries
- SQL-only interface, low interoperability
On-Prem Data Lakes
- Hadoop ecosystem
- Data Processing via MapReduce, Hive, Spark, etc.
- Difficult for data governance and data integrity
Some Problems with Big Data Workloads
ACID Transactions
- ACID transactions
- Corrupt/incomplete data
- Editability, schema evolution
- GDPR requests
- Data versioning and auditability (very important for getting Machine Learning models approved by regulators)
Some More Problems with Big Data Workloads
Batch vs Streaming conundrum
- Lambda Architecture: Having to maintain different code bases and data stores for batch and for streaming
- Kappa Architecture: play everything through the streaming pipeline...but your retention period is limited and it’s a pain to manage, query, edit
- Moral of the story: streaming/Kappa architecture isn’t a silver bullet
Modern Solutions
- This presentation is actually from early/mid 2019 (ignore the upload date on YouTube),
- Delta Lake’s gotten even better since then!
- Executive Summary:
- Delta Lake is just Apache Parquet with a transaction log on top!
- Keep using your preferred storage technology (e.g. S3, Azure Data Lake Storage)
- Allows fine-grained updates, deletes to big data (GDPR)
- Allows data versioning, time travel, rollbacks, audit history
- Optimizes the layout of your underlying parquet files (via OPTIMIZE) for query performance
Delta Lake Key Features
ACID Transactions
- Ensures data integrity, using the transaction log
Updates, Deletes, Upserts
- Perform Data Manipulation Language (DML) commands such as updates/deletes/upserts
- Important for GDPR, improving/maintaining data quality of existing tables
Advanced Metadata handling
- Enhanced capturing and utilization of metadata
- Improves query performance compared to standard Parquet
Schema Enforcement & Evolution