Apache Spark Core Concepts: RDDs, DAGs, Job Execution, and Deployment Modes
RDD Operations and Core Abstractions
Spark applications manipulate data through Resilient Distributed Datasets (RDDs), which serve as the foundational data structure. A typical word count operation demonstrates the transformation pipeline:
val textFile = sparkContext.textFile("hdfs://cluster/data/input.txt")
val words = textFile.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val counts = pairs.reduceByKey((x, y) => x + y)
val result = counts.collect()The collect() action triggers the actual execution, returning all elements of the result RDD to the driver program as an array.
Core Terminology
RDD (Resilient Distributed Dataset): An immutable, partitioned collection of elements that can be operated on in parallel. Each RDD consists of multiple partitions distributed across cluster nodes, enabling parallel processing.
DAG (Directed Acyclic Graph): A directed graph with no directed cycles that represents the logical execution plan. The DAG scheduler builds a graph of RDD dependencies, known as lineage, which tracks all transformations applied to the base data.
Stage: A collection of tasks that share the same shuffle dependencies. The DAG scheduler divides the job into stages based on shuffle boundaries. Tasks within a stage can execute in parallel without network transfers.
Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (such as collect(), count(), or saveAsTextFile()). Each action triggers one job.
Task: The smallest unit of work sent to an executor. Each task processes one partition of an RDD. The number of tasks per stage equals the number of partitions in the final RDD of that stage.
Job Submission Lifecycle
When a Spark application submits a job, the following sequence occurs:
- The
spark-submitscript initializes aSparkContext, which instantiates theDAGSchedulerandTaskScheduler. - The
TaskSchedulercommunicates with the cluster manager (Master) to register the application. - The Master allocates resources across Worker nodes, launching Executor processes for the application.
- Executors register themselves back with the
TaskScheduleron the driver. Once registration completes, theSparkContextinitialization finishes. - Upon encountering an action operation, the driver creates a Job and submits it to the
DAGScheduler. - The
DAGSchedulerconstructs the DAG, identifies shuffle boundaries, and partitions the job into stages. Each stage generates aTaskSet. - The
TaskSchedulerdispatches tasks from eachTaskSetto available executors. - Executors receive tasks through their thread pools, deserialize the code, and execute the task logic on assigned partitions.
Deployment Architectures
Spark supports multiple cluster managers:
Local Mode: A single-machine setup where all Spark components run within a single JVM. Ideal for development, testing, and learning. No cluster configuration required.
Standalone Mode: Spark's built-in cluster manager consisting of Master and Worker nodes. The Master handles resource allocation, while Workers manage executors. Deployment requires SSH access to all nodes and consistent directory structures. Submission example:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://master-node:7077 \
/path/to/spark-examples_2.12-3.4.0.jar \
100YARN Mode: Two variants exist—Client and Cluster. In Client mode, the driver runs on the submitting machine, making it suitable for interactive sessions. In Cluster mode, the driver runs within the YARN ApplicationMaster, allowing the client to disconnect after submission. YARN deployment requires setting HADOOP_CONF_DIR or YARN_CONF_DIR environment variables.
# YARN Client Mode
./bin/spark-submit \
--class com.example.MySparkApp \
--master yarn \
--executor-memory 2G \
--num-executors 3 \
/path/to/application.jar
# YARN Cluster Mode
./bin/spark-submit \
--class com.example.MySparkApp \
--master yarn-cluster \
--executor-memory 2G \
/path/to/application.jarView application logs using: yarn logs -applicationId application_1234567890_0001
Dependency Types: Narrow vs Wide
Narrow Dependencies: Each parent RDD partition contributes to at most one child partition. Examples include map, filter, and union. These operations enable pipelined execution on a single node without data movement.
Wide Dependencies: Multiple child partitions depend on the same parent partition. Examples include groupByKey, reduceByKey, and join. These require shuffle operations, where data must cross network boundaries between executors.
The distinction impacts both performance and fault tolerance. Narrow dependencies support efficient pipelining and quick recovery—only the lost partition needs recomputation. Wide dependencies introduce network overhead and require all parent partitions to be available before shuffle can proceed.
Performance Advantages Over MapReduce
Spark SQL outperforms traditional Hive on MapReduce for several reasons:
- Reduced Disk I/O: MapReduce writes intermediate shuffle data to disk after each stage. Spark can cache shuffle outputs in memory, dramatically reducing I/O for iterative algorithms.
- Stage Consolidation: MapReduce enforces a strict map-shuffle-reduce pattern for every stage. Spark's RDD model chains multiple transformations within a single stage, eliminating unnecessary data movement.
- JVM Efficiency: MapReduce launches a new JVM for each task, incurring startup overhead of seconds per task. Spark reuses JVM threads within executors, amortizing startup costs across thousands of tasks.
Build Configuration
When packaging Spark applications with Maven, include the assembly plugin to create a fat JAR containing all dependencies:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>Build with: mvn clean package assembly:single