Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Setting Up Hadoop and Spark Clusters for Distributed Machine Learning

Tech May 19 2

Installing Java Developemnt Kit

Hadoop and Spark are built on Java, making JDK a fundamental requirement. On Ubuntu systems, install OpenJDK 8:

sudo apt-get update
sudo apt-get install openjdk-8-jdk

Verify the installation by running:

java -version

Record the Java installation path, typically located at /usr/lib/jvm/java-8-openjdk-amd64.

Creating a Dedicated System User

Establishing a seperate user account for big data components enhances security and resource management:

sudo useradd -m sparkuser -s /bin/bash
sudo passwd sparkuser

Add the new user to the sudo group:

sudo usermod -aG sudo sparkuser

Hadoop Cluster Deployment

Obtaining Hadoop Distribution

Retrieve Hadoop from Apache mirrors. This example uses version 3.3.1:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
sudo tar -xzf hadoop-3.3.1.tar.gz -C /opt/
mv /opt/hadoop-3.3.1 /opt/hadoop

Configuring System Environment

Update the user's shell profile to include Hadoop paths:

echo "export HADOOP_HOME=/opt/hadoop" >> ~/.bashrc
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc

Setting Up Pseudo-Distributed Mode

Pseudo-distributed mode runs each Hadoop daemon as a separate Java process, simulating a cluster environment on a single machine.

core-site.xml Configuration

Edit $HADOOP_HOME/etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/data/tmp</value>
    </property>
</configuration>

hdfs-site.xml Configuration

Configure HDFS replication factor and directory paths:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///opt/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///opt/hadoop/data/datanode</value>
    </property>
</configuration>

mapred-site.xml Configuration

Set up MapReduce framework properties:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

yarn-site.xml Configuration

Configure YARN resource management:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Configuring SSH Accessibility

Enable passwordless SSH access to localhost for Hadoop daemon communication:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost

Initializing and Starting HDFS

Format the NameNode and start the Hadoop cluster:

hdfs namenode -format
start-dfs.sh
start-yarn.sh

Verify running services:

jps

Expected output includes NameNode, DataNode, SecondaryNameNode, ResourceManager, and NodeManager processses.

Spark Installation and Configuration

Downloading Spark

Obtain Spark with pre-built Hadoop package:

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
sudo tar -xzf spark-3.3.2-bin-hadoop3.tgz -C /opt/
mv /opt/spark-3.3.2-bin-hadoop3 /opt/spark

Configuring Spark Environment

Add Spark variables to the shell profile:

echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=$PATH:$SPARK_HOME/bin" >> ~/.bashrc
source ~/.bashrc

Integrating Spark with YARN

Modify $SPARK_HOME/conf/spark-env.sh:

cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
echo "export HADOOP_HOME=/opt/hadoop" >> $SPARK_HOME/conf/spark-env.sh
echo "export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop" >> $SPARK_HOME/conf/spark-env.sh

Distributed Machine Learning with Spark MLlib

Preparing Training Data

Create a sample dataset for classification:

from pyspark.sql import SparkSession
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

spark = SparkSession.builder \
    .appName("DistributedML") \
    .config("spark.master", "yarn") \
    .getOrCreate()

data = [(0, 1.0, 2.0, 0.0), 
        (1, 1.0, 2.0, 1.0),
        (2, 1.5, 1.5, 0.0),
        (3, 5.0, 5.0, 1.0),
        (4, 5.5, 5.2, 1.0)]

columns = ["id", "feature_a", "feature_b", "label"]
df = spark.createDataFrame(data, columns)

Feature Engineering and Model Training

Transform features into vector format and train a Random Forest model:

assembler = VectorAssembler(
    inputCols=["feature_a", "feature_b"],
    outputCol="features"
)

dataset = assembler.transform(df)
(training, testing) = dataset.randomSplit([0.8, 0.2])

rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    numTrees=10
)

model = rf.fit(training)
predictions = model.transform(testing)

Evaluating Model Performance

Assess prediction accuracy using built-in evaluators:

evaluator = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print(f"Model Accuracy: {accuracy}")

spark.stop()

Execute the distributed machine learning job:

spark-submit --class org.apache.spark.deploy.SparkSubmit /path/to/your-script.py

Cluster Validation

Confirm all services are operational by accessing:

Tags: Hadoop

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.