Quiz

What is one challenge of working with parquet as compared to csv with respect to debugging?

It’s not human-readable (compressed) and requires software to read/print it.

When first ingesting from a data source, why do you maintain a copy of the raw data as close to the original format as possible?

* You want to maintain your original source of truth in case of bugs/logic changes * You don’t want to realize months down the line that you’ve accidentally been overwriting/tampering with the raw data and that it’s no longer recoverable! * If (..or more like when) ever you want to fix/update your data transformation logic, it’s also good to have an optimized copy of your raw data (e.g. in Parquet vs plain text) so that your re-processing doesn’t take as long! * It also helps to debug by breaking the steps into an ingestion step and transformation steps.

When should you use ETL vs ELT, why?

ETL is great for workloads that are reused by many consumers so it makes sense to shape the data consistently sooner rather than later
ELT is great for consumers who need some flexibility. With respect to speed, ELT can sometimes be a bit slow and wasteful as you have to often re-process/re-transform the raw data all over again.

What’s the difference between a Join and a Union?

Hopefully the pictures/diagrams in this article provide a clear intuition. Both operations are essential knowledge!

Please avoid these classic mistakes:

If you’re using JOIN, make sure that you don’t have duplicate column names on the two tables before joining (other than the joining keys themselves)
If you’re using UNION, make sure that the two tables/DataFrames have identical columns and column orderings

What does a GROUP BY aggregation do?

In standard SQL, it aggregates rows that share the same grouping key into a single summary row

What’s one scenario where using Window functions are advantageous over GroupBy aggregations?

Have a look at this example

Basically, Window functions allow you to maintain all of your original rows (without having to collapse/summarize them per group)

Of course, there are times when you’d want to aggregate instead of window as well, depends on the query/business question!

What kinds of operations often induce shuffling in your Spark job?

“Wide Transformations/Dependencies” such as joins, aggregations, window functions. Implication: they can really slow down your Spark job. Concise summary here.

What are some most powerful ways that you can optimise your Spark job?

Partition your data smartly such that the most common filters and groupBys in your queries don’t have to scan/shuffle all partitions unnecessarily

Joins are often important and inevitable...so how might you optimize a join to minimize shuffling?

via broadcast join (small join tables) OR partition your data such that data for the same join key isn’t spread across too many different partitions

What’s a good rule of thumb for partitioning?

Partition on columns that you would typically do a filter/groupBy

A partition should ideally contain anywhere between 256MB - 1GB of data. Too many small partitions (each containing kilobytes means you have lots of small files - that’s bad!)

For the small file reason above, you generally shouldn’t partition on high cardinality columns