Apache Spark Primer
Introduction to Spark
A Must-Read: Slides from Brooke Wenig
Important comment upon Frank Kane’s video
- Scala is only faster than Python if you’re writing a lot of custom UDFs or data structures with RDDs. If you’re using built-in Spark functions, then performance is identical
- Most modern Spark users are shifting towards a Python codebase to take advantage of modern data science and machine learning tools - see next slides for empirical evidence 😉
Takeaways
- What is the difference between Transformations (lazy evaluation) and Actions?
- What is the difference between Driver and Worker nodes
Spark SQL Programming Guide (Python or Scala recommended)
Are You Still a Python Hater?
Data + AI Summit 2020
One of the keynote presentations from the Chief Architect of Databricks
Project Zen: Making Spark Pythonic | Reynold Xin | Keynote Data + AI Summit EU 2020
- Heading towards taking advantage of idiomatic Python with type hints
- Improving Python debugging is on the Databricks roadmap
- There’s no denying the rich ecosystem of libraries, especially for advanced analytics & ML
Are you a Pandaphile?
Try Databricks Koalas
- PySpark but using Pandas syntax and APIs
- If you’re interested, try it out and let us know what you think!
- However for the rest of this course, we’ll teaching you the following APIs separately:
- Spark DataFrames
- Pandas