Hadoop Framework Installation and Application Experiments
Experiment Requirements
Applicable Majors: Computer Science and Technology, Software Engineering, Internet of Things Engineering
Learning Objectives: Understand distributed architecture and Linux commands, achieve proficiency in Hadoop installation, HDFS programming, and MapReduce development.
Experiment Project List:
| No. | Project | Hours | Type | Description |
|---|---|---|---|---|
| 1 | Hadoop Installation and Use | 4 | Validation | Install Linux on a virtual machine, set up Hadoop, and learn basic Linux and Hadoop commands. |
| 2 | HDFS Application | 4 | Design | Master Hadoop file commands, compile and run Java programs on Linux, perform file read/write operations, and sort data. |
| 3 | MapReduce Application I—Top 10 Word Count | 4 | Design | Develop, compile, and execute MapReduce programs in Linux using Java. Create a word count program to identify the top 10 most frequent words. |
| 4 | MapReduce Application II—Movie Recommendation | 4 | Design | Implement a movie recommendation system using a three-phase MapReduce solution: count raters per movie, identify users who rated both movies A and B, and compute correlations between movie pairs. |
Experiment Report Guidelines: Reports must include experiment title, objectives, completion date, core design concepts and algorithms, results, and summary. Identical reports are prohibited.
Grading Criteria: Grades consist of attendance (10%), experiment completion (80%), and report quality (10%). Completion is based on the four experiment outcomes; report are evaluated on content, formatting, accuracy, and originality.
Recommended Textbook: Principles and Applications of Big Data Technology (2nd Edition, Lin Ziyu, People's Posts and Telecommunications Pres, 2017)
Reference Books:
- Data Algorithms/Hadoop/Spark Big Data Processing Techniques (Mahmoud Parsian, translated by Su Jinguo et al., China Electric Power Press, 2016)
Experiment 1: Hadoop Setup and Configuration
Content
Install Linux on a virtual machine, deploy Hadoop, and explore basic Linux and Hadoop commands.
Objectives
- Comprehend the three operational modes of Hadoop.
- Master the installation process for Hadoop pseudo-distributed mode.
- Develop independence in configuring Hadoop pseudo-distributed installations.
- Gain familiarity with the Eclipse development environment.
- Acquire proficiency in installing Hadoop development plugins.
Environment
- Linux Ubuntu 16.04
- JDK 8u162 Linux x64
- Hadoop 3.1.3
- Eclipse 4.7.0 Linux GTK x86_64
Procedure
Refer to detailed tutorials such as those by Professor Lin Ziyu from Xiamen University.
Results
Navigate to the Hadoop directory and start the distributed file system:
cd /usr/local/hadoop
./sbin/start-dfs.sh
Verify the running processes with jps and access the web interface at localhost:9870.
Experiment 2: HDFS File Operations and Sorting
Content
Utilize Hadoop file commands, write, compile, and execute Java programs on Linux to perform file read/write operations and sort data.
Objectives
- Become adept with Hadoop file commands.
- Gain experience in writing, compiling, and running Java applications on Linux.
- Understand the usage of the Eclipse development environment.
- Skillfully install Hadoop development plugins.
Environment
Same as Experiment 1.
Procedure
- Start Hadoop services.
- Upload text files to HDFS:
./bin/hdfs dfs -put /home/user/data/file1.txt /user/hadoop - Verify the upload by listing the directory contents.
- Launch Eclipse and configure the necessary JAR files from Hadoop's common, common/lib, hdfs, and hdfs/lib directories.
- Develop a Java application, package it into a JAR, and execute it.
- Retrieve and display the output from HDFS.
Results
Confirm successful startup, file upload, and execution of the Java application with sorted output.
Source Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.*;
import java.util.*;
public class HDFSSort {
public static void main(String[] args) {
try {
List<Integer> values = new ArrayList<>();
Configuration config = new Configuration();
config.set("fs.defaultFS", "hdfs://localhost:9000");
FileSystem hdfs = FileSystem.get(config);
Path sourceFile1 = new Path("input1.txt");
Path sourceFile2 = new Path("input2.txt");
BufferedReader reader1 = new BufferedReader(new InputStreamReader(hdfs.open(sourceFile1)));
BufferedReader reader2 = new BufferedReader(new InputStreamReader(hdfs.open(sourceFile2)));
String line;
while ((line = reader1.readLine()) != null) {
values.add(Integer.parseInt(line.trim()));
}
while ((line = reader2.readLine()) != null) {
values.add(Integer.parseInt(line.trim()));
}
reader1.close();
reader2.close();
Collections.sort(values);
for (int num : values) {
System.out.println(num);
}
Path resultFile = new Path("sorted_output.txt");
FSDataOutputStream writer = hdfs.create(resultFile);
for (int num : values) {
writer.writeBytes(String.valueOf(num) + "\n");
}
writer.close();
hdfs.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Experiment 3: MapReduce Word Frequency Analysis
Content
Write, compile, and run a MapReduce program in Linux using Java to perform word count and identify the top 10 most frequent words.
Objectives
- Gain proficiency in developing, compiling, and executing MapReduce applications on Linux.
- Familiarize with the Eclipse development environment.
- Successfully install Hadoop development plugins.
Environment
Same as previous experiments.
Procedure
- Start the Hadoop cluster.
- Inspect HDFS directories and remove any residual folders.
- Create input directories and upload a text file for processing.
- In Eclipse, add required JARs from Hadoop's common, common/lib, and mapreduce directories.
- Develop the MapReduce program, package it, and run it with specified input and output paths.
- Examine the results stored in the output directory.
Results
Verify the execution of the MapReduce job and the generation of the top 10 word frequency list.
Source Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.*;
public class TopWordCount {
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Map<String, Integer> frequencyMap = new HashMap<>();
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int total = 0;
for (IntWritable val : values) {
total += val.get();
}
frequencyMap.put(key.toString(), total);
}
@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
List<Map.Entry<String, Integer>> entries = new ArrayList<>(frequencyMap.entrySet());
entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));
int limit = Math.min(10, entries.size());
for (int i = 0; i < limit; i++) {
Map.Entry<String, Integer> entry = entries.get(i);
context.write(new Text(entry.getKey()), new IntWritable(entry.getValue()));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "top word count");
job.setJarByClass(TopWordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(SumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Experiment 4: Movie Recommendasion with MapReduce
Content
Implement a three-stage MapReduce solution for movie recommendations: count raters per movie, identify common raters for movie pairs, and compute inter-movie correlations.
Objectives
Develop a multi-phase MapReduce pipeline to analyze user ratings and derive movie associations.
Environment
Same as previous experiments.
Procedure
Design and implement MapReduce jobs for each stage, ensuring data flow between phases. Test with sample rating datasets.
Results
Produce a set of movie correlations based on shared user ratings.