Home > Tech > Content

Deploying a Fully Distributed Hadoop Cluster

Tech Apr 13 22

Hadoop supports several operational modes: Local Mode, Pseudo-Distributed Mode, and Fully Distributed Mode.

Local Mode: Runs on a single machine, primarily for demonstrating official examples. Not used in production.
Pseudo-Distributed Mode: Also runs on a single machine but simulates a distributed environment with all Hadoop functionalities. Used by some organizations for testing, not production.
Fully Distributed Mode: Consists of multiple servers forming a true distributed environment. This is the mode used in production.

Local Mode Execution (WordCount Example)

1) Create a directory for input files

Navigate to your Hadoop installation directory and create a folder named input_data.

$ mkdir input_data

2) Create a text file

Move into the new directory and create a file.

$ cd input_data
$ vim sample.txt

Add the following content to sample.txt:

data processing
hadoop mapreduce
big data
big data

Save and exit the editor (:wq in vim).

3) Execute the WordCount program

Return to the Hadoop root directory and run the example JAR.

$ cd /opt/hadoop-3.3.0
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount input_data result_output

4) View the results

$ cat result_output/part-r-00000

The output should be:

big     2
data    2
hadoop  1
mapreduce       1
processing      1

Fully Distributed Mode Deployment

This is the primary setup for development and production. The process involves:

Preparing multiple client machines (configure firewall, static IP, hostname).
Installing the Java Development Kit (JDK).
Setting up environment variables.
Installing Hadoop.
Configuring the Hadoop cluster.
Starting services individually.
Configuring SSH for password-less access.
Starting the cluster collectively and testing.

Preparing the Environment

Ensure all machines have the necessary directories and correct permissions.

$ sudo chown hadoopuser:hadoopgroup -R /opt/hadoop

File Distribution with `scp` and `rsync`

Using scp (Secure Copy)

Copy the JDK from the primary node (node01) to a worker node (node02).

# From node01
$ scp -r /opt/jdk-11.0.15 hadoopuser@node02:/opt/

Copy the Hadoop installation from node01 to node03.

# From node03
$ scp -r hadoopuser@node01:/opt/hadoop-3.3.0 /opt/

Using rsync for Efficient Synchronization

rsync is faster than scp as it only transfers differences between files.

# Delete a test directory on node02
$ rm -rf /opt/hadoop-3.3.0/test_input/

# Synchronize the Hadoop directory from node01 to node02
$ rsync -av /opt/hadoop-3.3.0/ hadoopuser@node02:/opt/hadoop-3.3.0/

Creating a Cluster Distribution Script

A custom script (cluster_sync) can automate file distribution across all nodes.

1) Create the script

Create a file named cluster_sync in a directory within your PATH, like ~/bin/.

$ mkdir ~/bin
$ cd ~/bin
$ vim cluster_sync

Add the following script:

#!/bin/bash

# Check if at least one argument is provided
if [ $# -lt 1 ]
then
    echo "Error: Please specify files or directories to sync."
    exit 1
fi

# Define cluster nodes
NODES=("node01" "node02" "node03")

for NODE in "${NODES[@]}"
do
    echo "======== Syncing to $NODE ========"
    for ITEM in "$@"
    do
        if [ -e "$ITEM" ]
        then
            PARENT_DIR=$(cd -P "$(dirname "$ITEM")" && pwd)
            ITEM_NAME=$(basename "$ITEM")
            ssh "$NODE" "mkdir -p '$PARENT_DIR'"
            rsync -av "$PARENT_DIR/$ITEM_NAME" "$NODE:$PARENT_DIR/"
        else
            echo "Warning: '$ITEM' does not exist and will be skipped."
        fi
    done
done

2) Make the script executable and test it

$ chmod +x ~/bin/cluster_sync
$ cluster_sync ~/bin/cluster_sync

Configuring SSH Password-less Login

1) Generate SSH keys

On the primary node (node01), generate an RSA key pair. Press Enter for all prompts.

$ ssh-keygen -t rsa

This creates id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh/.

2) Copy the public key to all nodes

Use ssh-copy-id to install the public key on node01, node02, and node03.

$ ssh-copy-id node01
$ ssh-copy-id node02
$ ssh-copy-id node03

Note: Repeat the SSH key generation and distribution process for the hadoopuser account on node02 and node03 to enable mutual password-less login. Also configure it for the root account on node01 if necessary.

Cluster Configuration Planning

Distribute Hadoop services across nodes to balance load. Avoid placing the NameNode and SecondaryNameNode on the same server.

Node	HDFS Daemons	YARN Daemons
node01	NameNode, DataNode	NodeManager
node02	DataNode	ResourceManager, NodeManager
node03	SecondaryNameNode, DataNode	NodeManager

Configuring Hadoop Files

Hadoop uses XML configuration files located in $HADOOP_HOME/etc/hadoop/.

1) Core Configuration (core-site.xml)

Define the default filesystem and Hadoop data directory.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node01:8020</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop-3.3.0/data</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>hadoopuser</value>
    </property>
</configuration>

2) HDFS Configuration (hdfs-site.xml)

Configure web UI addresses for the NameNode and SecondaryNameNode.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>node01:9870</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node03:9868</value>
    </property>
</configuration>

3) YARN Configuration (yarn-site.xml)

Specify the shuffle service and the ResourceManager's host.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node02</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

4) MapReduce Configuration (mapred-site.xml)

Instruct MapReduce to run on the YARN framework.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

5) Distribute Configuration Files

Use the cluster_sync script to send the updated configurations to all nodes.

$ cluster_sync /opt/hadoop-3.3.0/etc/hadoop/

Starting the Cluster

1) Configure Worker Nodes

Edit the workers file on node01 to list all DataNode and NodeManager hosts.

$ vim /opt/hadoop-3.3.0/etc/hadoop/workers

Add:

node01
node02
node03

Ensure no trailing spaces or blank lines. Sync this file.

$ cluster_sync /opt/hadoop-3.3.0/etc

2) Format the HDFS NameNode (First-Time Setup Only)

Warning: Formatting creates a new cluster ID. If re-formatting is necessary on a live cluster, you must first stop all services and delete the data and logs directories on all nodes to prevent data loss.

$ hdfs namenode -format

3) Start HDFS and YARN Servicees

Start HDFS from the NameNode host (node01).

$ start-dfs.sh

Start YARN from the ResourceManager host (node02).

$ start-yarn.sh

4) Access Web Interfaces

HDFS NameNode UI: http://node01:9870
YARN ResourceManager UI: http://node02:8088

Basic Cluster Operations

1) HDFS File Operations

Create a directory and upload a file.

$ hadoop fs -mkdir /user_input
$ hadoop fs -put $HADOOP_HOME/input_data/sample.txt /user_input

2) Running a MapReduce Job

Execute the WordCount example on the cluster.

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /user_input /job_output

Configuring Auxiliary Services

1) Job History Server

Add the following properties to mapred-site.xml on node01 to enable the history server.

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>node01:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>node01:19888</value>
</property>

Sync the configuration and start the server.

$ cluster_sync $HADOOP_HOME/etc/hadoop/mapred-site.xml
$ mapred --daemon start historyserver

Access the history server UI at http://node01:19888.

2) Log Aggregation

To collect application logs in HDFS, add these properties to yarn-site.xml.

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<property>
    <name>yarn.log.server.url</name>
    <value>http://node01:19888/jobhistory/logs</value>
</property>
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

After syncing the file, restart YARN and the HistoryServer for the changes to take effect.

# On node02
$ stop-yarn.sh
# On node01
$ mapred --daemon stop historyserver

# Then start them again
$ start-yarn.sh
$ mapred --daemon start historyserver

Cluster Management Scripts

1) Unified Cluster Control Script

Create a script cluster_control.sh to start/stop all services.

#!/bin/bash
case $1 in
"start")
    echo "Starting HDFS..."
    ssh node01 "$HADOOP_HOME/sbin/start-dfs.sh"
    echo "Starting YARN..."
    ssh node02 "$HADOOP_HOME/sbin/start-yarn.sh"
    echo "Starting History Server..."
    ssh node01 "$HADOOP_HOME/bin/mapred --daemon start historyserver"
    ;;
"stop")
    echo "Stopping History Server..."
    ssh node01 "$HADOOP_HOME/bin/mapred --daemon stop historyserver"
    echo "Stopping YARN..."
    ssh node02 "$HADOOP_HOME/sbin/stop-yarn.sh"
    echo "Stopping HDFS..."
    ssh node01 "$HADOOP_HOME/sbin/stop-dfs.sh"
    ;;
*)
    echo "Usage: $0 {start|stop}"
    ;;
esac

2) Cluster-Wide Process Check Script

Create check_jps.sh to view Java processes on all nodes.

#!/bin/bash
for host in node01 node02 node03
do
    echo "--- $host ---"
    ssh $host jps
done

Make scripts executable and distribute them.

$ chmod +x ~/bin/cluster_control.sh ~/bin/check_jps.sh
$ cluster_sync ~/bin/

Key Network Ports in Hadoop 3.x

Service	Default Port
NameNode RPC	8020 / 9000 / 9820
NameNode HTTP UI	9870
YARN ResourceManager UI	8088
MapReduce JobHistory Server	19888

Back to List

Prev: Enforcing Immutability for JavaScript Constants

Next: Creating Artistic Images with Neural Networks: Style Transfer and GAN Implementation

Fading Coder

Deploying a Fully Distributed Hadoop Cluster

Local Mode Execution (WordCount Example)

Fully Distributed Mode Deployment

Preparing the Environment

File Distribution with `scp` and `rsync`

Creating a Cluster Distribution Script

Configuring SSH Password-less Login

Cluster Configuration Planning

Configuring Hadoop Files

Starting the Cluster

Basic Cluster Operations

Configuring Auxiliary Services

Cluster Management Scripts

Key Network Ports in Hadoop 3.x

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Deploying a Fully Distributed Hadoop Cluster

Local Mode Execution (WordCount Example)

Fully Distributed Mode Deployment

Preparing the Environment

File Distribution with scp and rsync

Creating a Cluster Distribution Script

Configuring SSH Password-less Login

Cluster Configuration Planning

Configuring Hadoop Files

Starting the Cluster

Basic Cluster Operations

Configuring Auxiliary Services

Cluster Management Scripts

Key Network Ports in Hadoop 3.x

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

File Distribution with `scp` and `rsync`

Leave a Comment