Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Apache Spark Core Concepts: RDDs, DAGs, Job Execution, and Deployment Modes

Notes 1

RDD Operations and Core Abstractions

Spark applications manipulate data through Resilient Distributed Datasets (RDDs), which serve as the foundational data structure. A typical word count operation demonstrates the transformation pipeline:

val textFile = sparkContext.textFile("hdfs://cluster/data/input.txt")
val words = textFile.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val counts = pairs.reduceByKey((x, y) => x + y)
val result = counts.collect()

The collect() action triggers the actual execution, returning all elements of the result RDD to the driver program as an array.

Core Terminology

RDD (Resilient Distributed Dataset): An immutable, partitioned collection of elements that can be operated on in parallel. Each RDD consists of multiple partitions distributed across cluster nodes, enabling parallel processing.

DAG (Directed Acyclic Graph): A directed graph with no directed cycles that represents the logical execution plan. The DAG scheduler builds a graph of RDD dependencies, known as lineage, which tracks all transformations applied to the base data.

Stage: A collection of tasks that share the same shuffle dependencies. The DAG scheduler divides the job into stages based on shuffle boundaries. Tasks within a stage can execute in parallel without network transfers.

Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (such as collect(), count(), or saveAsTextFile()). Each action triggers one job.

Task: The smallest unit of work sent to an executor. Each task processes one partition of an RDD. The number of tasks per stage equals the number of partitions in the final RDD of that stage.

Job Submission Lifecycle

When a Spark application submits a job, the following sequence occurs:

  1. The spark-submit script initializes a SparkContext, which instantiates the DAGScheduler and TaskScheduler.
  2. The TaskScheduler communicates with the cluster manager (Master) to register the application.
  3. The Master allocates resources across Worker nodes, launching Executor processes for the application.
  4. Executors register themselves back with the TaskScheduler on the driver. Once registration completes, the SparkContext initialization finishes.
  5. Upon encountering an action operation, the driver creates a Job and submits it to the DAGScheduler.
  6. The DAGScheduler constructs the DAG, identifies shuffle boundaries, and partitions the job into stages. Each stage generates a TaskSet.
  7. The TaskScheduler dispatches tasks from each TaskSet to available executors.
  8. Executors receive tasks through their thread pools, deserialize the code, and execute the task logic on assigned partitions.

Deployment Architectures

Spark supports multiple cluster managers:

Local Mode: A single-machine setup where all Spark components run within a single JVM. Ideal for development, testing, and learning. No cluster configuration required.

Standalone Mode: Spark's built-in cluster manager consisting of Master and Worker nodes. The Master handles resource allocation, while Workers manage executors. Deployment requires SSH access to all nodes and consistent directory structures. Submission example:

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://master-node:7077 \
  /path/to/spark-examples_2.12-3.4.0.jar \
  100

YARN Mode: Two variants exist—Client and Cluster. In Client mode, the driver runs on the submitting machine, making it suitable for interactive sessions. In Cluster mode, the driver runs within the YARN ApplicationMaster, allowing the client to disconnect after submission. YARN deployment requires setting HADOOP_CONF_DIR or YARN_CONF_DIR environment variables.

# YARN Client Mode
./bin/spark-submit \
  --class com.example.MySparkApp \
  --master yarn \
  --executor-memory 2G \
  --num-executors 3 \
  /path/to/application.jar

# YARN Cluster Mode
./bin/spark-submit \
  --class com.example.MySparkApp \
  --master yarn-cluster \
  --executor-memory 2G \
  /path/to/application.jar

View application logs using: yarn logs -applicationId application_1234567890_0001

Dependency Types: Narrow vs Wide

Narrow Dependencies: Each parent RDD partition contributes to at most one child partition. Examples include map, filter, and union. These operations enable pipelined execution on a single node without data movement.

Wide Dependencies: Multiple child partitions depend on the same parent partition. Examples include groupByKey, reduceByKey, and join. These require shuffle operations, where data must cross network boundaries between executors.

The distinction impacts both performance and fault tolerance. Narrow dependencies support efficient pipelining and quick recovery—only the lost partition needs recomputation. Wide dependencies introduce network overhead and require all parent partitions to be available before shuffle can proceed.

Performance Advantages Over MapReduce

Spark SQL outperforms traditional Hive on MapReduce for several reasons:

  1. Reduced Disk I/O: MapReduce writes intermediate shuffle data to disk after each stage. Spark can cache shuffle outputs in memory, dramatically reducing I/O for iterative algorithms.
  2. Stage Consolidation: MapReduce enforces a strict map-shuffle-reduce pattern for every stage. Spark's RDD model chains multiple transformations within a single stage, eliminating unnecessary data movement.
  3. JVM Efficiency: MapReduce launches a new JVM for each task, incurring startup overhead of seconds per task. Spark reuses JVM threads within executors, amortizing startup costs across thousands of tasks.

Build Configuration

When packaging Spark applications with Maven, include the assembly plugin to create a fat JAR containing all dependencies:

<plugin>
  <artifactId>maven-assembly-plugin</artifactId>
  <configuration>
    <descriptorRefs>
      <descriptorRef>jar-with-dependencies</descriptorRef>
    </descriptorRefs>
  </configuration>
</plugin>

Build with: mvn clean package assembly:single

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.