Data Milky Way: A Brief History (Part 2) - Evolution
1980s: Databases & Data Warehouses
Since the 1980s up until the mid-late 2000s, businesses still stored both OLTP and OLAP data in relational databases (RDBMS)
e.g. Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.
OLAP-oriented databases for storing large amounts of data were called Data Warehouses
Relational databases were notoriously difficult to distribute/scale out
As more data came in, businesses often opted to scale up 💰 (scale up vs scale out)
- Greater investment risk, difficult to plan capacity, expensive
- Restrictive limits to CPU & memory for a single server
- Nowadays, modern Massively Parallel Processing (MPP) warehouses can provide scale out capabilities
However, Data Warehouses still require careful upfront planning
- Schema and layout need to be decided beforehand
- Query patterns/use-cases needed to be preempted
- Separation of storage and compute often a weakness (perhaps except for modern warehouses such as Google BigQuery and Snowflake)
Data Warehouses
Pros
- Good for Business Intelligence (BI) applications (structured data, so long as it isn't too massive)
Cons
- Limited support for advanced analytics & machine learning workloads
- Limited support for non-tabular, unstructured data (e.g. free text, images)
- Proprietary systems with only a SQL interface
Because of the limitations around flexibility and scaling, a new technology emerged in the early 2000s.
Early 2000s: Hadoop Hype Train
Arrival of On-Prem Data Lakes (2004) Google published the MapReduce whitepaper, inspiring the Apache Hadoop project
- enabled (on-prem) distributed data processing on commodity (cheap) hardware
- businesses began throwing data into their Hadoop clusters!
The Hadoop Disillusionment
Several years later, people became disillusioned about Hadoop:
Working with distributed computing can be extremely challenging with respect to the learning curve, especially for business analysts. Instead of working with SQL, they'd have to learn Java libraries and frameworks related to MapReduce.
Excerpt:
Hadoop is great if you're a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analystics to get at the data.
In addition to the learning curve, the performance was surprisingly slow for BI users. They would have expected that compared to the warehouse, they woul dhave had performance gains but in fact the queries were running a lot slower on hadoop than on the traditional data warehouses.
Excerpt:
"At the Hive layer, it's kind of OK. But people think they're going to use Hadoop for data warehouse...are pretty surprised that this hot new technology is 10x slower than what they're using before," Johnson says.
This disillusionment led to the sunsetting of Hadoop projects, including at Google (who were the inventors of the entire ecossystem).
Senior Vice President of Technical Infrastructure, Google
While data locality + coupling storage and compute in Hadoop clusters was a decent idea for data throughput…
- Businesses were forced to increase both CPU & Disk when they often only needed to scale up just one or the other
- You’d have to pay for more CPU just to store inactive, rarely-utilized data, what a waste!
- Storing and replicating data on HDFS (Hadoop Distributed File System) was expensive and difficult to maintain
- Query performance was lackluster and other beneficial properties of RDBMS were gone
Latency
Latency nightmare when only using disk and network (i.e. MapReduce). Let's get a feel for it relative to "humanized" time.
Optional: Latency in Big Data (demonstrates the massive impact of using memory vs disk vs network in Big Data)
- Gives you a feel for the latency nightmare when only using disk and network (i.e. MapReduce)
- Hopefully demonstrates the orders of magnitude in performance gain with Spark!
Cloud Revolution
Cloud Revolution: Why object storage wins over Hadoop-based storage
- To scale cost-effectively, we need to really separate compute and storage
- e.g. simply provision more CPU-intensive clusters only when needed, while leaving storage the same
- As analytics and AI began to involve images, audio, unstructured data:
- Cloud Data Lakes (often based on object storage) became the ideal storage solution
Time for unified analytics/query engines such as Spark and Presto to shine 💫
- Spark & Presto Both engines excel when running analytical queries against data stored on Object Storage (e.g. Amazon S3, Azure Blob Storage) Both engines take advantage of both memory and disk (unlike Hadoop MapReduce which read/writes data via disk only)
Spark (much more on this later) is extremely popular for programmatic (Python/Scala/Java/R) use-cases but can also support SQL queries.
Presto is a popular choice for ad-hoc interactive SQL queries (fun fact: AWS Athena is a serverless offering based on Presto)
Unified Analytics Engines
Common Misconceptions
- “Spark & Presto are NoSQL databases/data stores”
- They’re not. But they can read/write from them 😀
- “Spark or Presto can replace all my databases”
- “Spark is an ‘In-Memory’ technology”
- PostgreSQL/MySQL also cache data in RAM to speed up queries...but would you call them ‘In-Memory’ technologies? 😅
Many companies (especially tech giants) can even often have both Spark and Presto/Athena in their stack
Mission Accomplished?
While being able to directly query your cloud object storage (e.g. S3, Azure Data Lake Storage) with big data engines such as Spark and Presto moved the field in a great direction, some missing pieces still remained
2000s - 2010s: Traditional Data Lakes
Pros
- Extremely cheap and scalable
- Open, arbitrary data formats and big ecosystem
- Supports ML
Cons
- Some BI workloads still not snappy enough
- Complex data quality problems
- No data management layer
- Difficult for GDPR compliance
Can we get the best of both worlds?
One approach: Data Lakehouse (we’ll revisit this - keep it in mind!)
If you’ve ever heard about Delta Lake, Dremio, Databricks Photon stay tuned for the next lessons! We’ll also cover how Lakehouse can serve as the technological foundation to support philosophies such as Data Mesh
- Data Mesh = Organizational & Operating Principles
- Data Lakehouse = Storage, Compute, Data Management, Governance layers