Flink Java Development Environment Setup
Flink is considered one of the top tools in the big data field.
It has been incorporated into the Apache Foundation.
This article introduces the development environment setup, not intended for production use.
I. Flink Overview
Note: The following content was generated by edge's Copilot and slightly organized by the author.
🧬 Origins and Development of Flink
Apache Flink originated from a research project called Stratosphere at the Technical University of Berlin in Germany in 2009, which was a distributed data processing platform research initiative funded by the German Research Foundation. In 2014, the project was donated to the Apache Foundation and renamed Flink, which means "agile and fast" in German.
- March 2014: Became an Apache incubator project
- December 2014: Promoted to an Apache top-level project
- Since 2015: Alibaba widely adopted Flink and developed its internal version Blink
- 2019: Alibaba open-sourced Blink and merged it into the main Flink branch, significantly enhancing Flink's SQL capabilities and performance
🚀 Core Features of Flink
Apache Flink is a unified stream and batch processing engine with the following key capabilities:
- Native Stream Processing (Stream-first): Flink was designed for stream processing and supports low-latency processing of unbounded data streams.
- Unified Runtime (Batch and Stream): Use the same API and execution engine to process both batch and stream data, simplifying development.
- Event Time and Out-of-Order Handling: Supports event time semantics (Event Time) and watermark mechanisms to handle out-of-order data.
- Stateful Processing: Supports large-scale stateful computation, with state that can be persisted and fault-tolerant.
- Exactly-once Semantics: Achieves end-to-end exactly-once processing using Checkpoint and Savepoint.
- High Scalability and Fault Tolerance: Suppports large-scale distributed deployment and automatically recovers failed tasks.
- Multi-language Support: Supports Java, Scala, Python, SQL, and more.
- Rich Connector Ecosystem: Includes built-in connectors for Kafka, MySQL, HDFS, Elasticsearch, Pulsar, and more.
🌐 Flink's Ecosystem and Use Cases
Flink has become the de facto standard for real-time data processing globally, widely used in:
- Real-time data warehouses (such as Alibaba's real-time data warehouse)
- Real-time monitoring and alert systems
- Real-time recommendation and personalization services
- Real-time ETL and data integration (such as Flink CDC)
- IoT data stream processing
- Financial risk control and transaction monitoring
In addition, Flink has several sub-projects, such as:
- Flink Table API / SQL: Used for declarative stream and batch development
- Flink CDC: Used for real-time data synchronization
- Apache Paimon (formerly Flink Table Store): A key component for building a streaming data lake
🧭 Current and Future
As of 2025, Flink has reached version 2.0.0, marking a new phase in its architecture.
Flink 2.0 introduced stronger cloud-native capabilities (such as separated state storage), more flexible scheduling mechanisms, and more unified stream and batch semantics, further solidifying its position in the real-time computing domain.
II. Why Choose Flink
The company has undertaken many government and enterprise projects, including IoT data processing, such as collecting data from fire hydrants, watches, and various sensors.
Although the data isn't particularly large, most require a real-time monitoring function.
In the past, the approach was often simple: data was first stored in a database and then provided to the frontend via WebSocket or REST API. If additional calculations were needed, it would be slower.
However, this approach had significant limitations: the data could not be too large, the timeliness could not be high, and the coding was extensive.
The biggest problem was that the timeliness was not satisfactory in certain scenarios or difficult to meet.
So, if there is a tool that can solve these issues?
Could we develop it ourselves using frameworks like Netty? It's possible, but the necessity is not sufficient. Why not use existing solutions?
Flink can solve these issues in most scenarios:
- Supports stream computing and real-time computing
- Supports dynamic scaling
- Supports Java language
- Java-written, supports various operating systems, including domestic alternatives
Additionally, it supports stream and batch processing together.
Let's compare Flink with other similar frameworks.
Comparison
📊 Comparison of Apache Flink with Other Major Big Data Processing Frameworks (Multi-dimensional)
DimensionApache FlinkApache SparkApache StormKafka StreamsGoogle DataflowProcessing ModelNative stream processing, supports batch processing (stream-batch unification)Batch processing primarily, supports micro-batch stream processingNative stream processingNative stream processing, dependent on KafkaStream-batch unification (Beam model)Latency PerformanceMillisecond-level low latencySecond-level latency (micro-batch)Millisecond-levelMillisecond-levelMillisecond-levelThroughput CapabilityHigh throughput, suitable for large-scale data streamsHigh throughput, suitable for batch computingMedium throughputMedium throughputHigh throughputState ManagementBuilt-in strong state management (supports RocksDB)Weaker state managementNo built-in stateLightweight state support in memorySupports (Beam State API)Failure Recovery MechanismExactly-once, based on CheckpointExactly-once (Structured Streaming)At-least-onceAt-least-onceExactly-onceEvent Time Support✅ Strong Watermark + EventTime semantics✅ Supports event time (complex logic)❌ No event time❌ No event time✅ Native supportWindow MechanismSupports rolling, sliding, session, and custom windowsSupports rolling and sliding windowsBasic window supportRolling and sliding windowsVarious window typesDevelopment Language SupportJava, Scala, Python, SQLJava, Scala, Python, R, SQLJava, ClojureJava, ScalaJava, Python, GoDeployment ComplexityModerate (requires JobManager, TaskManager)Moderate (requires Spark cluster)Simple (lightweight deployment)Embedded deploymentCloud-managed (GCP)EcosystemFlink SQL, CDC, Paimon, StateFunSpark SQL, MLlib, GraphXLimited componentsKafka tightly coupled componentsGCP ecosystem integrationApplicable ScenariosReal-time data warehouse, financial risk control, IoT, CEPOffline analysis, batch processing, MLLog monitoring, low-latency computingLight-weight stream logic in Kafka applicationsCloud-native data flow ETL
✨ Summary Highlights
- Apache Flink: Leading stream processing capabilities, strong state support, and exact-once semantics, suitable for real-time analysis and complex computations.
- Apache Spark: Benchmark for batch processing, suitable for large-scale offline tasks and big data modeling.
- Apache Storm: Ultra-low latency, but limited functionality, unsuitable for complex data stream scenarios.
- Kafka Streams: Suitable for light-weight stream logic in Kafka consumers, easy to deploy.
- Google Dataflow: Cloud-native, concise, and powerful, suitable for fully managed data pipelines.
III. Setting Up a Simple Development Environment
Note that although version 2.0 is available, Flink continues to support version 1.20.x for a long time.
Setting up a relatively simple environment for development.
There are two options to choose from:
- Do not install Flink - even without a Flink server, you can use a local mini cluster for basic development and testing. However, managing other jobs and submitting jobs is challenging.
- Install Flink - allows you to experience a production-like environment in terms of performance, concurrency, and enables intuitive job submission and management.
I chose to install Flink, even though I didn't use it for a long time initially.
3.1 Installing Flink on CentOS
Reference:https://nightlies.apache.org/flink/flink-docs-release-2.0/zh/ ### 1. Create a Flink User
And create two directories, accessible only by the Flink user. You can also install as root.### 2. Install JDK 17
Set the JAVA_HOME variable. Details are omitted.### 3. Modify Configuration
After extracting Flink, there is a file in the conf directory:Modify the config.yaml file mainly.Usually, modify three places:JAVA_HOMEBe careful with the opts section settings; if you start locally, refer to this. jobmanagerbind-host: specifies the client machines that can access itport: specifies the external port rest client Flink defaults to using log4j for logging on the server side. If you want to change it, you can specify the Log4j configuration here. ### 4. Start and Stop
In the bin directory:
Starting and stopping is simple, no parameters needed.For example, starting: ./start-cluster.shIf successful, you can access the management client via a browser:Note: This is only for a development environment; it should never be used without authentication in production. ### 5. Test
Local mode installation | Apache Flink
Here, test submitting jobs directly through the Flink command:
$ ./bin/flink run examples/streaming/WordCount.jar
3.2 Creating a Spring Boot Project
Use your IDE to create a JDK 17, Spring Boot 3.x project. Details are omitted.
To test Flink's batch and stream processing, it is necessary to set up some other environments:
a. A relational database, such as MySQL, PostgreSQL, etc.
b. An MQTT service server, or consider using an AMQP server. I chose to install Apache Artemis. Artemis supports MQTT, AMQP, STOMP, and other protocols.
Artemis has the advantage of being small, easy to enstall, well-documented, and entirely written in Java, meaning it can run in various environments, naturally including domestic environments.
The above installations are omitted.
Here is a reference pom.xml configuration:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.lzfto.flink</groupId>
<artifactId>learn-flink-spring</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>learn-flink</name>
<properties>
<lzfto.flinktest.version>0.0.1-SNAPSHOT</lzfto.flinktest.version>
<java.version>17</java.version>
<!--Spring boot -->
<spring.boot.version>3.4.7</spring.boot.version>
<flink.version>1.20.0</flink.version>
<flink-connector.jdbc.version>3.3.0-1.20</flink-connector.jdbc.version>
<mysql-version>8.0.33</mysql-version>
<mysql-j-version>9.3.0</mysql-j-version>
<apache-poi-version>5.4.1</apache-poi-version>
<alibaba-fastjson-version>2.0.57</alibaba-fastjson-version>
<!-- mqtt -->
<mqtt-client.version>6.5.0</mqtt-client.version>
<!-- maven etc. -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<dependencyManagement>
<dependencies>
<!-- SpringBoot dependencies -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-dependencies</artifactId>
<version>${spring.boot.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<!-- Spring Boot Starter Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jdbc</artifactId>
<!--<artifactId>spring-jdbc</artifactId> -->
</dependency>
<!-- websocket -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-websocket</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>
<!-- Original Flink Dependencies -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope> -->
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- tableapi bridge, mainly for connecting with DataStream -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope> -->
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- ide debug -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-loader</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope> -->
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-runtime</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-test-utils</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- Connectors -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>${flink-connector.jdbc.version}</version>
</dependency>
<!-- Execution factory definition -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Database connection -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql-version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba.fastjson2/fastjson2 -->
<dependency>
<groupId>com.alibaba.fastjson2</groupId>
<artifactId>fastjson2</artifactId>
<version>${alibaba-fastjson-version}</version>
</dependency>
<!-- mqtt -->
<dependency>
<groupId>org.springframework.integration</groupId>
<artifactId>spring-integration-mqtt</artifactId>
</dependency>
<!-- Apache POI for Excel operations -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>${apache-poi-version}</version> <!-- Please check the latest version -->
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>${apache-poi-version}</version> <!-- Please check the latest version -->
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<!-- Test dependencies -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.13.0</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
<encoding>${project.build.sourceEncoding}</encoding>
<parameters>true</parameters>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.6.0</version>
<executions>
<!-- Run shade goal on package phase -->
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<!-- If you want to exclude some packages -->
<excludes>
<!--<exclude>org.apache.flink:force-shading</exclude> -->
<exclude>com.google.code.findbugs:jsr305</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder. Otherwise,
this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.lzfto.flink.demo.DemoApplication</mainClass>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<resources>
<resource>
<directory>src/main/resources/</directory>
<includes>
<include>**/*</include>
</includes>
</resource>
<resource>
<!-- directory indicates the files in this directory -->
<directory>libs</directory>
<!-- targetPath specifies where to package the files, default is under the class directory -->
<targetPath>/BOOT-INF/lib/</targetPath>
<!-- includes all files matching the format -->
<includes>
<include>**/*.jar</include>
</includes>
</resource>
</resources>
<finalName>lzfto-${lzfto.flinktest.version}</finalName>
</build>
</project>
Finally, I wrote some code for testing, which is located at:
https://gitee.com/lu_zhifei/learn-flink/tree/master/java/first-tableapi-spring