Version: 2.0

🕑Estimated time for completion

This section takes about 2h20m to complete.

Pawarit Laosunthara

Curator

Tim Kläsener

Curator

Data Milky Way: A Brief History (Part 2) - Evolution

1980s: Databases & Data Warehouses

Since the 1980s up until the mid-late 2000s, businesses still stored both OLTP and OLAP data in relational databases (RDBMS)

e.g. Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.
OLAP-oriented databases for storing large amounts of data were called Data Warehouses
Relational databases were notoriously difficult to distribute/scale out
As more data came in, businesses often opted to scale up 💰 (scale up vs scale out)
- Greater investment risk, difficult to plan capacity, expensive
- Restrictive limits to CPU & memory for a single server
- Nowadays, modern Massively Parallel Processing (MPP) warehouses can provide scale out capabilities

However, Data Warehouses still require careful upfront planning

Schema and layout need to be decided beforehand
Query patterns/use-cases needed to be preempted
Separation of storage and compute often a weakness (perhaps except for modern warehouses such as Google BigQuery and Snowflake)

Data Warehouses

Pros

Good for Business Intelligence (BI) applications (structured data, so long as it isn't too massive)

Cons

Limited support for advanced analytics & machine learning workloads
Limited support for non-tabular, unstructured data (e.g. free text, images)
Proprietary systems with only a SQL interface

Because of the limitations around flexibility and scaling, a new technology emerged in the early 2000s.

Early 2000s: Hadoop Hype Train

Arrival of On-Prem Data Lakes (2004) Google published the MapReduce whitepaper, inspiring the Apache Hadoop project

enabled (on-prem) distributed data processing on commodity (cheap) hardware
businesses began throwing data into their Hadoop clusters!

Bonus content: What you need to know about Hadoop (Very detailed and not required for the remainder of the course!)

The Hadoop Disillusionment

Several years later, people became disillusioned about Hadoop:

Working with distributed computing can be extremely challenging with respect to the learning curve, especially for business analysts. Instead of working with SQL, they'd have to learn Java libraries and frameworks related to MapReduce.

Excerpt:

info

Hadoop is great if you're a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analystics to get at the data.

In addition to the learning curve, the performance was surprisingly slow for BI users. They would have expected that compared to the warehouse, they would have had performance gains but in fact the queries were running a lot slower on hadoop than on the traditional data warehouses.

Excerpt:

info

"At the Hive layer, it's kind of OK. But people think they're going to use Hadoop for data warehouse...are pretty surprised that this hot new technology is 10x slower than what they're using before," Johnson says.

This disillusionment led to the sunsetting of Hadoop projects, including at Google (who were the inventors of the entire ecossystem).

Senior Vice President of Technical Infrastructure, Google

While data locality + coupling storage and compute in Hadoop clusters was a decent idea for data throughput…

Businesses were forced to increase both CPU & Disk when they often only needed to scale up just one or the other
You’d have to pay for more CPU just to store inactive, rarely-utilized data, what a waste!
Storing and replicating data on HDFS (Hadoop Distributed File System) was expensive and difficult to maintain
Query performance was lackluster and other beneficial properties of RDBMS were gone

Latency

Latency nightmare when only using disk and network (i.e. MapReduce). Let's get a feel for it relative to "humanized" time.

Type	Time	Time (humanized form)
1 CPU Cycle	0.3 ns	1 s
Level 1 Cache Access	0.9 ns	3 s
Level 2 Cache Access	2.8 ns	9 s
Level 3 Cache Access	12.9 ns	43 s
Main Memory Access	120 ns	6 min
Solid-state disk I/O	50 - 150 µs	2 - 6 days
Rotational disk I/O	1 - 10 ms	1 - 12 months
Internet: SF to NYC	40 ms	4 years
Internet: SF to Australia	183 ms	19 years
OS Virtualization reboot	4 s	23 years
SCSI command time-out	30 s	3000 years
Hardware Virtualization reboot	40 s	4000 years
Physical System reboot	5 m	32 millenia

Latency matters in Hadoop

Hadoop is all about fault tolerance and simple api of map and reduce steps. But all that fault tolerance in Hadoop comes at a very huge cost:

To recover from potential failures, Hadoop MapReduce writes intermediate data between Map and Reduce steps to the disk for every job.
The MapReduce implementation in Hadoop, by design, shuffles the data between map and reduce steps, which is expensive.

Now looking at the above table, we know that writing to the disk and network communications are slow and expensive and quite naturally increases latency. Because of the inherent behavior of MapReduce (constant data transmissions and persisting), attention must be paid to the amount of data that gets transmitted to manage both speed and costs of queries.

Spark is pre-built to reduce latency

Spark also provides fault tolerance like Hadoop but with significantly reduced latency in the following way :

It keeps all data immutable and in memory, thus avoiding disk writes, by keeping data in memory.
It provides Scala like chains of operations (called transformations) and keeps track of it in its light weight client side process called as Spark Driver.
In the event of a node failure, it finds a new node, copies the partition data on that node and just simply plays the tracked operations in step 2 above.

Additionally Spark provides rich APIs in Scala, Python, Java and R. Lastly, unlike hadoop, it makes data shuffle optional by making it necessary only when using certain spark operations (formally called wide transformations in Spark) like joins, groupBy etc.

Optional content: 24 minute video version of Latency in Big Data (demonstrates the massive impact of using memory vs disk vs network in Big Data):

Cloud Revolution

Cloud Revolution: Why object storage wins over Hadoop-based storage

To scale cost-effectively, we need to really separate compute and storage
- e.g. simply provision more CPU-intensive clusters only when needed, while leaving storage the same
As analytics and AI began to involve images, audio, unstructured data:
- Cloud Data Lakes (often based on object storage) became the ideal storage solution

Time for unified analytics/query engines such as Spark and Presto to shine 💫

Spark & Presto
- Both engines excel when running analytical queries against data stored on Object Storage (e.g. Amazon S3, Azure Blob Storage)
- Both engines take advantage of both memory and disk (unlike Hadoop MapReduce which read/writes data via disk only)

Spark (much more on this later) is extremely popular for programmatic (Python/Scala/Java/R) use-cases but can also support SQL queries.

Presto is a popular choice for ad-hoc interactive SQL queries (e.g. AWS Athena is a serverless offering based on Presto).

Many companies (especially tech giants) can even often have both Spark and Presto/Athena in their stack:

Unified Analytics Engines

Both Spark and Presto can be considered Unified Analytics Engines, meaning they can cover most of your data needs from ingestion from different sources over transformation to visualisations. What's special is that this multi-purpose approach enables you to adjust the technologies to your needs (and the arrangement of your data) and not vice versa.

Mission Accomplished?

While being able to directly query your cloud object storage (e.g. S3, Azure Data Lake Storage) with big data engines such as Spark and Presto moved the field in a great direction, some missing pieces still remained. Let's have a look at the setup of traditional data lakes versus more versatile solutions.

2000s - 2010s: Traditional Data Lakes

Pros

Extremely cheap and scalable
Open, arbitrary data formats and big ecosystem
Supports ML

Cons

Some BI workloads still not snappy enough
Complex data quality problems
No data management layer
Difficult for GDPR compliance

Can we get the best of both worlds?

The goal in this space is to get rid of the current shortcomings listed above. We want to continue to use cheap and scalable computing resources and a wide ecosystem while being able to tackle more advanced problems, opening up data related problems to entire companies while maintaining appropriate data security and access policies.

One approach is the so-called Data Lakehouse which we’ll revisit in the second week of the course.

Other challengers in this area are Delta Lake, Dremio or Databricks Photon. We’ll also cover how Lakehouse can serve as the technological foundation to support philosophies such as Data Mesh near the end of this course.

Data Milky Way: A Brief History (Part 2) - Evolution

1980s: Databases & Data Warehouses​

Data Warehouses​

Early 2000s: Hadoop Hype Train​

The Hadoop Disillusionment​

Latency​

Latency matters in Hadoop​

Spark is pre-built to reduce latency​

Cloud Revolution​

Unified Analytics Engines​

Mission Accomplished?​

2000s - 2010s: Traditional Data Lakes​

Can we get the best of both worlds?​