Skip to main content
Version: 2.0

SQL: A Quick Review

When working with data, a solid knowledge of SQL is required to help think about your problem in terms of DataFrame transformations.

sql-joins.png

Have a quick scan through the topics that you're not too familiar with:

A Deep Dive on Window Functions

current-row.png

  • What limitations have you faced in your analytics when using GROUP BY in the past?
  • How can you do rankings per each category efficiently?
  • How can you get the entire row of the best-selling product? (imagine you want to see additional attributes of this product)

window-function-usage.png

It is important to make sure you understand everything about window function semantics.

  • partitionBy
    • similar to groupBy (but doesn’t reduce/aggregate information down to a single row)
  • orderBy
    • Just FYI, you can orderBy(F.col(“myColumn”).desc()) for descending order.
  • rowsBetween
    • You should look up the default arguments for different window functions (e.g. lag, lead, first, last, max, min, row_number)
  • rangeBetween Can you explain the difference between rowsBetween and rangeBetween?

These are all also concepts in SQL and not special to Spark.

Exercise: Windows

The next exercises uses Windows for several critical sections and we find that people are more successful in completing the exercises if they are familiarised with Windows. Follow the instructions here for the Windows Walkthrough.