Skip to main content
Version: 1.0

Apache Spark Primer

Introduction to Spark

A Must-Read: Slides from Brooke Wenig

Important comment upon Frank Kane’s video

  • Scala is only faster than Python if you’re writing a lot of custom UDFs or data structures with RDDs. If you’re using built-in Spark functions, then performance is identical
  • Most modern Spark users are shifting towards a Python codebase to take advantage of modern data science and machine learning tools - see next slides for empirical evidence 😉

Takeaways

  • What is the difference between Transformations (lazy evaluation) and Actions?
  • What is the difference between Driver and Worker nodes

Spark SQL Programming Guide (Python or Scala recommended)

Are You Still a Python Hater?

spark-usage-2013.png spark-usage-2020.png

Data + AI Summit 2020

One of the keynote presentations from the Chief Architect of Databricks

Project Zen: Making Spark Pythonic | Reynold Xin | Keynote Data + AI Summit EU 2020

  • Heading towards taking advantage of idiomatic Python with type hints
  • Improving Python debugging is on the Databricks roadmap
  • There’s no denying the rich ecosystem of libraries, especially for advanced analytics & ML

Are you a Pandaphile?

Try Databricks Koalas

  • PySpark but using Pandas syntax and APIs
  • If you’re interested, try it out and let us know what you think!
  • However for the rest of this course, we’ll teaching you the following APIs separately:
    • Spark DataFrames
    • Pandas