Apache Flink vs Spark Streaming: Core Concepts, Deployment, and Word Count Examples
Apache Spark and Apache Flink are both widely used big data processing frameworks, but they differ significantly in stream processing architectures. Spark uses Spark Streaming for micro-batch stream processing, while Flink is designed as a true stream processing engine with stream-batch unification capabilities.
Spark vs. Flink Stream Processing
Spark Core Components
- Spark SQL: Handles structured data with SQL queries and integrates with RDDs and Hive.
- Spark Streaming: Processes real-time data via micro-batches, dividing streams into small time-based chunks for batch computation.
- MLlib: Provides machine learning algorithms and tools for model training/evaluation.
- GraphX: A graph computing library for analyzing graph data and executing graph algorithms.
Flink Core Components
- DataStream API: For real-time stream processing, supporting event time handling, window operations, and state management with high throughput.
- DataSet API: For batch processing, similar to Hadoop MapReduce but with richer operators and optimizations.
- Stateful Stream Processing: Enables maintaining and managing state during processing, crucial for complex business logic.
- Event Time Processing: Handles out-of-order events and ensures accurate window computation results.
- Table API/SQL: Allows SQL-like syntax for querying and analyzing both stream and batch data.
- Ecosystem Integration: Connects to Kafka, Elasticsearch, JDBC, HDFS, and Amazon S3.
- Cluster Deployment: Runs on Kubernetes, YARN, Mesos, or Standalone clusters.
Flink Deployment
Standalone Cluster
Standalone is a simple deployment mode for development/testing. Key steps:
- Download Flink from Apache Flink's official website or mirrors like Aliyun.
- Extract the archive and modify
conf/flink-conf.yaml(e.g.,jobmanager.rpc.address,taskmanager.numberOfTaskSlots). - Start the cluster with
./bin/start-cluster.sh. - Submit jobs using
./bin/flink run -c com.example.MainClass ./path/to/app.jar.
Docker Deployment
Use Docker Compose for quick setup:
- Create a
flinkdirectory. - Write a
docker-compose.ymlfile to define JobManager and TaskManager services. - Start the cluster:
docker-compose up -d - Access the web UI at
http://<server-ip>:8081/.
Flink Quick Start: Word Count
Batch Processing Example
Requirements: Count word occurrences in a file and output to another file.
Dependencies:
<!-- Flink core -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.2</version>
</dependency>
<!-- Flink stream processing (provided scope for cluster execution) -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
Java Code:
package com.example.wordcount.batch;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class BatchWordCounter {
public static void main(String[] args) throws Exception {
String inputFile = "D:\\data\\input\\sample.txt";
String outputFile = "D:\\data\\output\\word_counts.csv";
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> rawData = env.readTextFile(inputFile);
DataSet<Tuple2<String, Integer>> wordPairs = rawData.flatMap(new WordSplitter());
DataSet<Tuple2<String, Integer>> counts = wordPairs.groupBy(0).sum(1);
counts.writeAsCsv(outputFile, "\n", " ").setParallelism(1);
env.execute();
}
static class WordSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] words = line.split("\\s+");
for (String word : words) {
out.collect(new Tuple2<>(word, 1));
}
}
}
}
Stream Processing Example
Requirements: Use a socket as a real-time data source (via nc), count words every second, and print results.
Java Code:
package com.example.wordcount.stream;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class StreamWordCounter {
public static void main(String[] args) throws Exception {
String host = "127.0.0.1";
int port = 9000;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> socketStream = env.socketTextStream(host, port, "\n");
SingleOutputStreamOperator<Tuple2<String, Long>> wordCounts = socketStream
.flatMap(new StreamWordSplitter())
.keyBy(0)
.sum(1);
wordCounts.print();
env.execute("Real-time Word Count");
}
static class StreamWordSplitter implements FlatMapFunction<String, Tuple2<String, Long>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Long>> out) throws Exception {
String[] words = line.split("\\s+");
for (String word : words) {
out.collect(Tuple2.of(word, 1L));
}
}
}
}
Flink Architecture
Job Submistion Process
- Initialization: Start Master and TaskManager processes via
bin/start-cluster.sh. - Resource Registration: TaskManagers register their available slots with the ResourceManager.
- Client Submission: User submits the job via the Flink client, which generates a JobGraph.
- Dispatcher & JobManager: Dispatcher receives the JobGraph and starts a JobManager for the job.
- Resource Allocation: JobManager requests resources from ResourceManager, which allocates idle slots.
- Task Deployment: JobManager deploys tasks to TaskManagers' slots for execution.
Core Components
- Client: Compiles the application, generates the JobGraph, and submits it to the Dispatcher.
- Dispatcher: Accepts job submissions and starts JobManagers for each application.
- JobManager: Coordinates job execution, manages task status, and handles fault recovery.
- ResourceManager: Manages cluster resources (slots) and allocates them to JobManagers.
- TaskManager: Executes actual computation tasks, manages slots, and communicates with other TaskManagers.
Component Stack
- Deployment Layer: Supports Local (SingleJVM/SingleNode), Cluster (Standalone/YARN/Kubernetes), and Cloud (AWS/GCP/AliCloud) modes.
- Runtime Layer: Handles distributed execution, fault tolerance (checkpoints), and task scheduling.
- API Layer: Provides DataStream API (stream) and DataSet API (batch) for building applications.
- Higher-Level Tools: CEP (Complex Event Processing), Gelly (Graph Computing), Table API/SQL, and PyFlink (Python support).
Execution Stages
- StreamGraph: Logical representation of the application's dataflow topology.
- JobGraph: Optimized version of StreamGraph with merged operators and parallelism info.
- ExecutionGraph: Physical execution plan with parallel tasks and execution edges.
- Physical Execution Graph: Actual tasks running on TaskManagers' slots.