Setting Up Hadoop and Spark Clusters for Distributed Machine Learning
Installing Java Developemnt Kit
Hadoop and Spark are built on Java, making JDK a fundamental requirement. On Ubuntu systems, install OpenJDK 8:
sudo apt-get update
sudo apt-get install openjdk-8-jdk
Verify the installation by running:
java -version
Record the Java installation path, typically located at /usr/lib/jvm/java-8-openjdk-amd64.
Creating a Dedicated System User
Establishing a seperate user account for big data components enhances security and resource management:
sudo useradd -m sparkuser -s /bin/bash
sudo passwd sparkuser
Add the new user to the sudo group:
sudo usermod -aG sudo sparkuser
Hadoop Cluster Deployment
Obtaining Hadoop Distribution
Retrieve Hadoop from Apache mirrors. This example uses version 3.3.1:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
sudo tar -xzf hadoop-3.3.1.tar.gz -C /opt/
mv /opt/hadoop-3.3.1 /opt/hadoop
Configuring System Environment
Update the user's shell profile to include Hadoop paths:
echo "export HADOOP_HOME=/opt/hadoop" >> ~/.bashrc
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc
Setting Up Pseudo-Distributed Mode
Pseudo-distributed mode runs each Hadoop daemon as a separate Java process, simulating a cluster environment on a single machine.
core-site.xml Configuration
Edit $HADOOP_HOME/etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/data/tmp</value>
</property>
</configuration>
hdfs-site.xml Configuration
Configure HDFS replication factor and directory paths:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/data/datanode</value>
</property>
</configuration>
mapred-site.xml Configuration
Set up MapReduce framework properties:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
yarn-site.xml Configuration
Configure YARN resource management:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Configuring SSH Accessibility
Enable passwordless SSH access to localhost for Hadoop daemon communication:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
Initializing and Starting HDFS
Format the NameNode and start the Hadoop cluster:
hdfs namenode -format
start-dfs.sh
start-yarn.sh
Verify running services:
jps
Expected output includes NameNode, DataNode, SecondaryNameNode, ResourceManager, and NodeManager processses.
Spark Installation and Configuration
Downloading Spark
Obtain Spark with pre-built Hadoop package:
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
sudo tar -xzf spark-3.3.2-bin-hadoop3.tgz -C /opt/
mv /opt/spark-3.3.2-bin-hadoop3 /opt/spark
Configuring Spark Environment
Add Spark variables to the shell profile:
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=$PATH:$SPARK_HOME/bin" >> ~/.bashrc
source ~/.bashrc
Integrating Spark with YARN
Modify $SPARK_HOME/conf/spark-env.sh:
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
echo "export HADOOP_HOME=/opt/hadoop" >> $SPARK_HOME/conf/spark-env.sh
echo "export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop" >> $SPARK_HOME/conf/spark-env.sh
Distributed Machine Learning with Spark MLlib
Preparing Training Data
Create a sample dataset for classification:
from pyspark.sql import SparkSession
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
spark = SparkSession.builder \
.appName("DistributedML") \
.config("spark.master", "yarn") \
.getOrCreate()
data = [(0, 1.0, 2.0, 0.0),
(1, 1.0, 2.0, 1.0),
(2, 1.5, 1.5, 0.0),
(3, 5.0, 5.0, 1.0),
(4, 5.5, 5.2, 1.0)]
columns = ["id", "feature_a", "feature_b", "label"]
df = spark.createDataFrame(data, columns)
Feature Engineering and Model Training
Transform features into vector format and train a Random Forest model:
assembler = VectorAssembler(
inputCols=["feature_a", "feature_b"],
outputCol="features"
)
dataset = assembler.transform(df)
(training, testing) = dataset.randomSplit([0.8, 0.2])
rf = RandomForestClassifier(
labelCol="label",
featuresCol="features",
numTrees=10
)
model = rf.fit(training)
predictions = model.transform(testing)
Evaluating Model Performance
Assess prediction accuracy using built-in evaluators:
evaluator = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
print(f"Model Accuracy: {accuracy}")
spark.stop()
Execute the distributed machine learning job:
spark-submit --class org.apache.spark.deploy.SparkSubmit /path/to/your-script.py
Cluster Validation
Confirm all services are operational by accessing:
- NameNode: http://localhost:9870
- ResourceManager: http://localhost:8088