Deploying a Fully Distributed Hadoop Cluster
Hadoop supports several operational modes: Local Mode, Pseudo-Distributed Mode, and Fully Distributed Mode.
- Local Mode: Runs on a single machine, primarily for demonstrating official examples. Not used in production.
- Pseudo-Distributed Mode: Also runs on a single machine but simulates a distributed environment with all Hadoop functionalities. Used by some organizations for testing, not production.
- Fully Distributed Mode: Consists of multiple servers forming a true distributed environment. This is the mode used in production.
Local Mode Execution (WordCount Example)
1) Create a directory for input files
Navigate to your Hadoop installation directory and create a folder named input_data.
$ mkdir input_data
2) Create a text file
Move into the new directory and create a file.
$ cd input_data
$ vim sample.txt
Add the following content to sample.txt:
data processing
hadoop mapreduce
big data
big data
Save and exit the editor (:wq in vim).
3) Execute the WordCount program
Return to the Hadoop root directory and run the example JAR.
$ cd /opt/hadoop-3.3.0
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount input_data result_output
4) View the results
$ cat result_output/part-r-00000
The output should be:
big 2
data 2
hadoop 1
mapreduce 1
processing 1
Fully Distributed Mode Deployment
This is the primary setup for development and production. The process involves:
- Preparing multiple client machines (configure firewall, static IP, hostname).
- Installing the Java Development Kit (JDK).
- Setting up environment variables.
- Installing Hadoop.
- Configuring the Hadoop cluster.
- Starting services individually.
- Configuring SSH for password-less access.
- Starting the cluster collectively and testing.
Preparing the Environment
Ensure all machines have the necessary directories and correct permissions.
$ sudo chown hadoopuser:hadoopgroup -R /opt/hadoop
File Distribution with scp and rsync
Using scp (Secure Copy)
Copy the JDK from the primary node (node01) to a worker node (node02).
# From node01
$ scp -r /opt/jdk-11.0.15 hadoopuser@node02:/opt/
Copy the Hadoop installation from node01 to node03.
# From node03
$ scp -r hadoopuser@node01:/opt/hadoop-3.3.0 /opt/
Using rsync for Efficient Synchronization
rsync is faster than scp as it only transfers differences between files.
# Delete a test directory on node02
$ rm -rf /opt/hadoop-3.3.0/test_input/
# Synchronize the Hadoop directory from node01 to node02
$ rsync -av /opt/hadoop-3.3.0/ hadoopuser@node02:/opt/hadoop-3.3.0/
Creating a Cluster Distribution Script
A custom script (cluster_sync) can automate file distribution across all nodes.
1) Create the script
Create a file named cluster_sync in a directory within your PATH, like ~/bin/.
$ mkdir ~/bin
$ cd ~/bin
$ vim cluster_sync
Add the following script:
#!/bin/bash
# Check if at least one argument is provided
if [ $# -lt 1 ]
then
echo "Error: Please specify files or directories to sync."
exit 1
fi
# Define cluster nodes
NODES=("node01" "node02" "node03")
for NODE in "${NODES[@]}"
do
echo "======== Syncing to $NODE ========"
for ITEM in "$@"
do
if [ -e "$ITEM" ]
then
PARENT_DIR=$(cd -P "$(dirname "$ITEM")" && pwd)
ITEM_NAME=$(basename "$ITEM")
ssh "$NODE" "mkdir -p '$PARENT_DIR'"
rsync -av "$PARENT_DIR/$ITEM_NAME" "$NODE:$PARENT_DIR/"
else
echo "Warning: '$ITEM' does not exist and will be skipped."
fi
done
done
2) Make the script executable and test it
$ chmod +x ~/bin/cluster_sync
$ cluster_sync ~/bin/cluster_sync
Configuring SSH Password-less Login
1) Generate SSH keys
On the primary node (node01), generate an RSA key pair. Press Enter for all prompts.
$ ssh-keygen -t rsa
This creates id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh/.
2) Copy the public key to all nodes
Use ssh-copy-id to install the public key on node01, node02, and node03.
$ ssh-copy-id node01
$ ssh-copy-id node02
$ ssh-copy-id node03
Note: Repeat the SSH key generation and distribution process for the hadoopuser account on node02 and node03 to enable mutual password-less login. Also configure it for the root account on node01 if necessary.
Cluster Configuration Planning
Distribute Hadoop services across nodes to balance load. Avoid placing the NameNode and SecondaryNameNode on the same server.
| Node | HDFS Daemons | YARN Daemons |
|---|---|---|
| node01 | NameNode, DataNode | NodeManager |
| node02 | DataNode | ResourceManager, NodeManager |
| node03 | SecondaryNameNode, DataNode | NodeManager |
Configuring Hadoop Files
Hadoop uses XML configuration files located in $HADOOP_HOME/etc/hadoop/.
1) Core Configuration (core-site.xml)
Define the default filesystem and Hadoop data directory.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-3.3.0/data</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoopuser</value>
</property>
</configuration>
2) HDFS Configuration (hdfs-site.xml)
Configure web UI addresses for the NameNode and SecondaryNameNode.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>node01:9870</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node03:9868</value>
</property>
</configuration>
3) YARN Configuration (yarn-site.xml)
Specify the shuffle service and the ResourceManager's host.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node02</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
4) MapReduce Configuration (mapred-site.xml)
Instruct MapReduce to run on the YARN framework.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5) Distribute Configuration Files
Use the cluster_sync script to send the updated configurations to all nodes.
$ cluster_sync /opt/hadoop-3.3.0/etc/hadoop/
Starting the Cluster
1) Configure Worker Nodes
Edit the workers file on node01 to list all DataNode and NodeManager hosts.
$ vim /opt/hadoop-3.3.0/etc/hadoop/workers
Add:
node01
node02
node03
Ensure no trailing spaces or blank lines. Sync this file.
$ cluster_sync /opt/hadoop-3.3.0/etc
2) Format the HDFS NameNode (First-Time Setup Only)
Warning: Formatting creates a new cluster ID. If re-formatting is necessary on a live cluster, you must first stop all services and delete the data and logs directories on all nodes to prevent data loss.
$ hdfs namenode -format
3) Start HDFS and YARN Servicees
Start HDFS from the NameNode host (node01).
$ start-dfs.sh
Start YARN from the ResourceManager host (node02).
$ start-yarn.sh
4) Access Web Interfaces
- HDFS NameNode UI:
http://node01:9870 - YARN ResourceManager UI:
http://node02:8088
Basic Cluster Operations
1) HDFS File Operations
Create a directory and upload a file.
$ hadoop fs -mkdir /user_input
$ hadoop fs -put $HADOOP_HOME/input_data/sample.txt /user_input
2) Running a MapReduce Job
Execute the WordCount example on the cluster.
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /user_input /job_output
Configuring Auxiliary Services
1) Job History Server
Add the following properties to mapred-site.xml on node01 to enable the history server.
<property>
<name>mapreduce.jobhistory.address</name>
<value>node01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node01:19888</value>
</property>
Sync the configuration and start the server.
$ cluster_sync $HADOOP_HOME/etc/hadoop/mapred-site.xml
$ mapred --daemon start historyserver
Access the history server UI at http://node01:19888.
2) Log Aggregation
To collect application logs in HDFS, add these properties to yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://node01:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
After syncing the file, restart YARN and the HistoryServer for the changes to take effect.
# On node02
$ stop-yarn.sh
# On node01
$ mapred --daemon stop historyserver
# Then start them again
$ start-yarn.sh
$ mapred --daemon start historyserver
Cluster Management Scripts
1) Unified Cluster Control Script
Create a script cluster_control.sh to start/stop all services.
#!/bin/bash
case $1 in
"start")
echo "Starting HDFS..."
ssh node01 "$HADOOP_HOME/sbin/start-dfs.sh"
echo "Starting YARN..."
ssh node02 "$HADOOP_HOME/sbin/start-yarn.sh"
echo "Starting History Server..."
ssh node01 "$HADOOP_HOME/bin/mapred --daemon start historyserver"
;;
"stop")
echo "Stopping History Server..."
ssh node01 "$HADOOP_HOME/bin/mapred --daemon stop historyserver"
echo "Stopping YARN..."
ssh node02 "$HADOOP_HOME/sbin/stop-yarn.sh"
echo "Stopping HDFS..."
ssh node01 "$HADOOP_HOME/sbin/stop-dfs.sh"
;;
*)
echo "Usage: $0 {start|stop}"
;;
esac
2) Cluster-Wide Process Check Script
Create check_jps.sh to view Java processes on all nodes.
#!/bin/bash
for host in node01 node02 node03
do
echo "--- $host ---"
ssh $host jps
done
Make scripts executable and distribute them.
$ chmod +x ~/bin/cluster_control.sh ~/bin/check_jps.sh
$ cluster_sync ~/bin/
Key Network Ports in Hadoop 3.x
| Service | Default Port |
|---|---|
| NameNode RPC | 8020 / 9000 / 9820 |
| NameNode HTTP UI | 9870 |
| YARN ResourceManager UI | 8088 |
| MapReduce JobHistory Server | 19888 |