Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Apache Flink vs Spark Streaming: Core Concepts, Deployment, and Word Count Examples

Tech May 12 3

Apache Spark and Apache Flink are both widely used big data processing frameworks, but they differ significantly in stream processing architectures. Spark uses Spark Streaming for micro-batch stream processing, while Flink is designed as a true stream processing engine with stream-batch unification capabilities.

Spark vs. Flink Stream Processing

Spark Core Components

  1. Spark SQL: Handles structured data with SQL queries and integrates with RDDs and Hive.
  2. Spark Streaming: Processes real-time data via micro-batches, dividing streams into small time-based chunks for batch computation.
  3. MLlib: Provides machine learning algorithms and tools for model training/evaluation.
  4. GraphX: A graph computing library for analyzing graph data and executing graph algorithms.

Flink Core Components

  1. DataStream API: For real-time stream processing, supporting event time handling, window operations, and state management with high throughput.
  2. DataSet API: For batch processing, similar to Hadoop MapReduce but with richer operators and optimizations.
  3. Stateful Stream Processing: Enables maintaining and managing state during processing, crucial for complex business logic.
  4. Event Time Processing: Handles out-of-order events and ensures accurate window computation results.
  5. Table API/SQL: Allows SQL-like syntax for querying and analyzing both stream and batch data.
  6. Ecosystem Integration: Connects to Kafka, Elasticsearch, JDBC, HDFS, and Amazon S3.
  7. Cluster Deployment: Runs on Kubernetes, YARN, Mesos, or Standalone clusters.

Flink Deployment

Standalone Cluster

Standalone is a simple deployment mode for development/testing. Key steps:

  1. Download Flink from Apache Flink's official website or mirrors like Aliyun.
  2. Extract the archive and modify conf/flink-conf.yaml (e.g., jobmanager.rpc.address, taskmanager.numberOfTaskSlots).
  3. Start the cluster with ./bin/start-cluster.sh.
  4. Submit jobs using ./bin/flink run -c com.example.MainClass ./path/to/app.jar.

Docker Deployment

Use Docker Compose for quick setup:

  1. Create a flink directory.
  2. Write a docker-compose.yml file to define JobManager and TaskManager services.
  3. Start the cluster:
    docker-compose up -d
    
  4. Access the web UI at http://<server-ip>:8081/.

Flink Quick Start: Word Count

Batch Processing Example

Requirements: Count word occurrences in a file and output to another file.

Dependencies:

<!-- Flink core -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.7.2</version>
</dependency>
<!-- Flink stream processing (provided scope for cluster execution) -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.12</artifactId>
    <version>1.7.2</version>
    <scope>provided</scope>
</dependency>

Java Code:

package com.example.wordcount.batch;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

public class BatchWordCounter {
    public static void main(String[] args) throws Exception {
        String inputFile = "D:\\data\\input\\sample.txt";
        String outputFile = "D:\\data\\output\\word_counts.csv";

        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSet<String> rawData = env.readTextFile(inputFile);

        DataSet<Tuple2<String, Integer>> wordPairs = rawData.flatMap(new WordSplitter());
        DataSet<Tuple2<String, Integer>> counts = wordPairs.groupBy(0).sum(1);

        counts.writeAsCsv(outputFile, "\n", " ").setParallelism(1);
        env.execute();
    }

    static class WordSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
            String[] words = line.split("\\s+");
            for (String word : words) {
                out.collect(new Tuple2<>(word, 1));
            }
        }
    }
}

Stream Processing Example

Requirements: Use a socket as a real-time data source (via nc), count words every second, and print results.

Java Code:

package com.example.wordcount.stream;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class StreamWordCounter {
    public static void main(String[] args) throws Exception {
        String host = "127.0.0.1";
        int port = 9000;

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> socketStream = env.socketTextStream(host, port, "\n");

        SingleOutputStreamOperator<Tuple2<String, Long>> wordCounts = socketStream
            .flatMap(new StreamWordSplitter())
            .keyBy(0)
            .sum(1);

        wordCounts.print();
        env.execute("Real-time Word Count");
    }

    static class StreamWordSplitter implements FlatMapFunction<String, Tuple2<String, Long>> {
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Long>> out) throws Exception {
            String[] words = line.split("\\s+");
            for (String word : words) {
                out.collect(Tuple2.of(word, 1L));
            }
        }
    }
}

Flink Architecture

Job Submistion Process

  1. Initialization: Start Master and TaskManager processes via bin/start-cluster.sh.
  2. Resource Registration: TaskManagers register their available slots with the ResourceManager.
  3. Client Submission: User submits the job via the Flink client, which generates a JobGraph.
  4. Dispatcher & JobManager: Dispatcher receives the JobGraph and starts a JobManager for the job.
  5. Resource Allocation: JobManager requests resources from ResourceManager, which allocates idle slots.
  6. Task Deployment: JobManager deploys tasks to TaskManagers' slots for execution.

Core Components

  1. Client: Compiles the application, generates the JobGraph, and submits it to the Dispatcher.
  2. Dispatcher: Accepts job submissions and starts JobManagers for each application.
  3. JobManager: Coordinates job execution, manages task status, and handles fault recovery.
  4. ResourceManager: Manages cluster resources (slots) and allocates them to JobManagers.
  5. TaskManager: Executes actual computation tasks, manages slots, and communicates with other TaskManagers.

Component Stack

  1. Deployment Layer: Supports Local (SingleJVM/SingleNode), Cluster (Standalone/YARN/Kubernetes), and Cloud (AWS/GCP/AliCloud) modes.
  2. Runtime Layer: Handles distributed execution, fault tolerance (checkpoints), and task scheduling.
  3. API Layer: Provides DataStream API (stream) and DataSet API (batch) for building applications.
  4. Higher-Level Tools: CEP (Complex Event Processing), Gelly (Graph Computing), Table API/SQL, and PyFlink (Python support).

Execution Stages

  • StreamGraph: Logical representation of the application's dataflow topology.
  • JobGraph: Optimized version of StreamGraph with merged operators and parallelism info.
  • ExecutionGraph: Physical execution plan with parallel tasks and execution edges.
  • Physical Execution Graph: Actual tasks running on TaskManagers' slots.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.