Skip to main content
Version: 1.0

Apache Spark

Introduction to Spark

Reflect

  • Understand the difference between Transformations (lazy evaluation) and Actions
  • Understand the difference between Driver and Worker nodes

Spark Cluster Topology

Working with a distributed system can be confusing to grasp. Here are the basic must-knows:

  • Very important that you understand the difference between driver and worker nodes
  • Variables and data in your driver program are not automatically accessible/editable in your worker nodes
  • You’ll either need to define those constants in your UDF (see Databricks exercises) or look at broadcast/accumulator variables

spark-cluster-topology.png