Version: 2.0

🕑Estimated time for completion

This section takes about 15 minutes to complete.

Curator

Paradigm Shifts

History of Big Data - Revisited

Databases
- single node servers, vertical scaling only
- mostly used for operational data, transactions
On-Prem Data Warehouses
- horizontal scaling possible
- utilized massively parallel processing (MPP) to run big queries
- SQL-only interface, low interoperability
On-Prem Data Lakes
- Hadoop ecosystem
- Data Processing via MapReduce, Hive, Spark, etc.
- Difficult for data governance and data integrity

Pros:

Cons

cloud-data-lakes-adls

Pros

Cons

Can be susceptible to poor data quality/integrity
Not as fast as a database/data warehouse for interactive needs (e.g. dashboards with several drilldowns/complex queries)

...enter the Data Lakehouse

Databricks

Webpage - Demo Hub
Photon engine: taking advantage of modern hardware with C++
Delta Lake scalable, efficient open format on the lake but with ACID capabilities and data reliability
Databricks SQL:
- Takes advantage of Delta Lake for data integrity
- Photon for fast, interactive queries

Dremio

Webpage
Utilizes Apache Arrow for scalable data transfer and serialization
Informally, think of it as next-gen Presto/Athena (SQL Engine for directly querying the lake)

The Cloud Providers AWS: Glue Ecosystem, Athena, Lake Formation, etc. Azure: Azure Databricks, Azure Synapse Ecosystem, etc.

Big Data technologies/architectures such as Lakehouse are rapidly maturing to support the needs of Data Mesh