This section takes about 10 minutes to complete.
Quiz
What are some key differences between OLAP and OLTP systems?
OLTP
Frequent updates, transactional behaviour/requirements
Query priorities:
- low-latency, up-to-date, consistent data for standard business operations
OLAP
Oriented towards analysis, modelling, reporting, business intelligence
Query Priorities:
- asking big questions against big data, supporting flexible queries/analyses
- efficiently processing millions of rows and returning thousands/millions of rows
Why did people start moving from traditional databases and data warehouses to HDFS for Big Data Analytics in the mid 2000s?
Traditional relational databases and data warehouses relied on vertical scaling which required expensive/specialized hardware and was hard to plan for. HDFS and MapReduce allowed for horizontal scalability with cheap commodity hardware.
And subsequently, why did people move from HDFS to Object Storage for Big Data Analytics?
Cloud-based Object Storage is super scalable and cheap, far more than HDFS (which is only often used in on-prem environments these days). It was also great for storing all sorts of file formats (unstructured text, images, video, etc.). Most modern cloud-based Data Lakes are built-on Object Storage technologies. Modern query engines such as Apache Spark, Presto, Dremio, etc. can efficiently scan data laying on object stores. To summarize: object storage + (on-demand) query engines = fully decoupled storage and compute (great for scalability, elasticity, cost)
Name some object storage offerings from the 3 major cloud providers
AWS
- Amazon S3 (general purpose)
- Amazon S3 Glacier (for archiving)
Microsoft Azure
- Azure Blob Storage (general purpose)
- Azure Data Lake Storage Gen2 (has a file systems hierarchy, great for large structured and semi-structured datasets)
GCP
- Google Cloud Storage or GCS (general purpose)