This section takes about 10 minutes to complete.
Quiz
What is one challenge of working with parquet as compared to csv with respect to debugging?
When first ingesting from a data source, why do you maintain a copy of the raw data as close to the original format as possible?
When should you use ETL vs ELT, why?
- ETL is great for workloads that are reused by many consumers so it makes sense to shape the data consistently sooner rather than later
- ELT is great for consumers who need some flexibility. With respect to speed, ELT can sometimes be a bit slow and wasteful as you have to often re-process/re-transform the raw data all over again. At the end the day, it depends what makes the most sense for your use-case!
What’s the difference between a Join and a Union?
Hopefully the pictures/diagrams in this article provide a clear intuition. Both operations are essential knowledge!
Please avoid these classic mistakes:
- If you’re using JOIN, make sure that you don’t have duplicate column names on the two tables before joining (other than the joining keys themselves)
- If you’re using UNION, make sure that the two tables/DataFrames have identical columns and column orderings
What does a GROUP BY aggregation do?
What’s one scenario where using Window functions are advantageous over GroupBy aggregations?
Have a look at this example
Basically, Window functions allow you to maintain all of your original rows (without having to collapse/summarize them per group)
Of course, there are times when you’d want to aggregate instead of window as well, depends on the query/business question!
What kinds of operations often induce shuffling in your Spark job?
“Wide Transformations/Dependencies” such as joins, aggregations, window functions. Implication: they can really slow down your Spark job. Concise summary here.
What are some most powerful ways that you can optimise your Spark job?
Joins are often important and inevitable...so how might you optimize a join to minimize shuffling?
What’s a good rule of thumb for partitioning?
A partition should ideally contain anywhere between 256MB - 1GB of data. Too many small partitions (each containing kilobytes means you have lots of small files - that’s bad!)
For the small file reason above, you generally shouldn’t partition on high cardinality columns