Version: 1.0

Data Milky Way: A Brief History (Part 2) - Evolution

1980s: Databases & Data Warehouses

Since the 1980s up until the mid-late 2000s, businesses still stored both OLTP and OLAP data in relational databases (RDBMS)

e.g. Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.
OLAP-oriented databases for storing large amounts of data were called Data Warehouses
Relational databases were notoriously difficult to distribute/scale out
As more data came in, businesses often opted to scale up 💰 (scale up vs scale out)
- Greater investment risk, difficult to plan capacity, expensive
- Restrictive limits to CPU & memory for a single server
- Nowadays, modern Massively Parallel Processing (MPP) warehouses can provide scale out capabilities

However, Data Warehouses still require careful upfront planning

Schema and layout need to be decided beforehand
Query patterns/use-cases needed to be preempted
Separation of storage and compute often a weakness (perhaps except for modern warehouses such as Google BigQuery and Snowflake)

Data Warehouses

Pros

Good for Business Intelligence (BI) applications (structured data, so long as it isn't too massive)

Cons

Limited support for advanced analytics & machine learning workloads
Limited support for non-tabular, unstructured data (e.g. free text, images)
Proprietary systems with only a SQL interface

Because of the limitations around flexibility and scaling, a new technology emerged in the early 2000s.

Early 2000s: Hadoop Hype Train

Arrival of On-Prem Data Lakes (2004) Google published the MapReduce whitepaper, inspiring the Apache Hadoop project

enabled (on-prem) distributed data processing on commodity (cheap) hardware
businesses began throwing data into their Hadoop clusters!

Read: What you need to know about Hadoop

The Hadoop Disillusionment

Several years later, people became disillusioned about Hadoop:

Working with distributed computing can be extremely challenging with respect to the learning curve, especially for business analysts. Instead of working with SQL, they'd have to learn Java libraries and frameworks related to MapReduce.

Excerpt:

info

Hadoop is great if you're a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analystics to get at the data.

In addition to the learning curve, the performance was surprisingly slow for BI users. They would have expected that compared to the warehouse, they woul dhave had performance gains but in fact the queries were running a lot slower on hadoop than on the traditional data warehouses.

Excerpt:

info

"At the Hive layer, it's kind of OK. But people think they're going to use Hadoop for data warehouse...are pretty surprised that this hot new technology is 10x slower than what they're using before," Johnson says.

This disillusionment led to the sunsetting of Hadoop projects, including at Google (who were the inventors of the entire ecossystem).

Senior Vice President of Technical Infrastructure, Google

While data locality + coupling storage and compute in Hadoop clusters was a decent idea for data throughput…

Businesses were forced to increase both CPU & Disk when they often only needed to scale up just one or the other
You’d have to pay for more CPU just to store inactive, rarely-utilized data, what a waste!
Storing and replicating data on HDFS (Hadoop Distributed File System) was expensive and difficult to maintain
Query performance was lackluster and other beneficial properties of RDBMS were gone

Latency

Latency nightmare when only using disk and network (i.e. MapReduce). Let's get a feel for it relative to "humanized" time.

Optional: Latency in Big Data (demonstrates the massive impact of using memory vs disk vs network in Big Data)

Gives you a feel for the latency nightmare when only using disk and network (i.e. MapReduce)
Hopefully demonstrates the orders of magnitude in performance gain with Spark!

Cloud Revolution

Cloud Revolution: Why object storage wins over Hadoop-based storage

To scale cost-effectively, we need to really separate compute and storage
- e.g. simply provision more CPU-intensive clusters only when needed, while leaving storage the same
As analytics and AI began to involve images, audio, unstructured data:
- Cloud Data Lakes (often based on object storage) became the ideal storage solution

Time for unified analytics/query engines such as Spark and Presto to shine 💫

Spark & Presto Both engines excel when running analytical queries against data stored on Object Storage (e.g. Amazon S3, Azure Blob Storage) Both engines take advantage of both memory and disk (unlike Hadoop MapReduce which read/writes data via disk only)

Spark (much more on this later) is extremely popular for programmatic (Python/Scala/Java/R) use-cases but can also support SQL queries.

Presto is a popular choice for ad-hoc interactive SQL queries (fun fact: AWS Athena is a serverless offering based on Presto)

Unified Analytics Engines

Common Misconceptions

“Spark & Presto are NoSQL databases/data stores”
- They’re not. But they can read/write from them 😀
“Spark or Presto can replace all my databases”
“Spark is an ‘In-Memory’ technology”
- PostgreSQL/MySQL also cache data in RAM to speed up queries...but would you call them ‘In-Memory’ technologies? 😅

Many companies (especially tech giants) can even often have both Spark and Presto/Athena in their stack

Mission Accomplished?

While being able to directly query your cloud object storage (e.g. S3, Azure Data Lake Storage) with big data engines such as Spark and Presto moved the field in a great direction, some missing pieces still remained

2000s - 2010s: Traditional Data Lakes

Pros

Extremely cheap and scalable
Open, arbitrary data formats and big ecosystem
Supports ML

Cons

Some BI workloads still not snappy enough
Complex data quality problems
No data management layer
Difficult for GDPR compliance

Can we get the best of both worlds?

One approach: Data Lakehouse (we’ll revisit this - keep it in mind!)

If you’ve ever heard about Delta Lake, Dremio, Databricks Photon stay tuned for the next lessons! We’ll also cover how Lakehouse can serve as the technological foundation to support philosophies such as Data Mesh

Data Mesh = Organizational & Operating Principles
Data Lakehouse = Storage, Compute, Data Management, Governance layers

Data Milky Way: A Brief History (Part 2) - Evolution

1980s: Databases & Data Warehouses​

Data Warehouses​

Early 2000s: Hadoop Hype Train​

The Hadoop Disillusionment​

Latency​

Cloud Revolution​

Unified Analytics Engines​

Mission Accomplished?​

2000s - 2010s: Traditional Data Lakes​

Can we get the best of both worlds?​