Working with HDFS File System Commands and Performance Benchmarking
File System Operations
Searching Files
To locate files within HDFS, use the find command with the pattern specified after the -name flag:
hadoop fs -find / -name "application_*"
Modifying Permissions
Changing permissions requires appropriate ownership. Direct attempts with root may fail:
hadoop fs -chmod -R 777 /
# Result: Permission denied. Switch to hadoop user
Successful execution after switching users:
su - hadoop
hadoop fs -chmod -R 777 /
Directory and File Counting
The count command provides metrics about directories, files, and their sizes:
hdfs dfs -count [-q] [-h] <paths>
-q: Displays quota information-h: Shows sizes in human-readable format
Example output without -q:
DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
With quota information:
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
Bypassing Trash During Deletion
When encountering symbolic link errors during removal:
hadoop fs -rm -r cosn://{bucket}/emr/hadoop-yarn/staging/hadoop/.staging -skipTrash
# Error: Symbolic link does not exist
Use -skipTrash flag to force deletion:
hadoop fs -rm -r -skipTrash cosn://{bucket}/emr/hadoop-yarn/staging/hadoop/.staging
Performance Evaluation
I/O Benchmark Testing
Execute write performance tests with TestDFSIO:
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -write -nrFiles 10 -size 4GB -bufferSize 8388608 -resFile ./TestDFSIOwrite.log
Key parameters:
-nrFiles: Number of files to process-size: Individual file size-bufferSize: Buffer size for I/O operations (default 1MB)-resFile: Output results file
Handling Depenedncies
If ClassNotFoundException occurs:
scp junit-4.11.jar /usr/local/service/hadoop/share/hadoop/common/lib/
Analyzing Results
Sample output interpretation:
----- TestDFSIO ----- : write
Date & time: Thu Apr 18 20:50:27 CST 2024
Number of files: 10
Total MBytes processed: 40960
Throughput mb/sec: 122.97
Average IO rate mb/sec: 123.25
IO rate std deviation: 5.85
Test exec time sec: 53.6
Cluster throughput calculation: 40960 MB / 53.6 seconds = 764 MB/s (aggregate parallel processing)
Benchmark Data Locations
Generated test data directories:
hadoop fs -ls /benchmarks/TestDFSIO/
Content details stored in:
hadoop fs -cat /benchmarks/TestDFSIO/io_write/part-00000
Practical Test Scenarios
Bandwidth testing with different storage systems:
# COSN write test
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -Dfs.defaultFS=cosn://{bucket} -Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN -libjars /data/xxx.jar -write -nrFiles 40 -size 4GB -bufferSize 8388608 -resFile ./50gCosnTestDFSIOwrite.log
# COSN read test
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -Dfs.defaultFS=cosn://{bucket} -Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN -libjars /data/xxx.jar -read -nrFiles 40 -size 4GB -bufferSize 8388608 -resFile ./50gCosnTestDFSIOread.log
# S3A write test
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -Dfs.defaultFS=s3a://{bucket} -Dfs.s3a.endpoint=https://cos.ap-shanghai.myqcloud.com -write -nrFiles 40 -size 4GB -bufferSize 8388608 -resFile ./50gS3ATestDFSIOwrite.log
# S3A read test
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -Dfs.defaultFS=s3a://{bucket} -Dfs.s3a.endpoint=https://cos.ap-shanghai.myqcloud.com -read -nrFiles 40 -size 4GB -bufferSize 8388608 -resFile ./50gS3ATestDFSIOread.log
Optimized COSN random read testing:
hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -Dfs.defaultFS=cosn://{bucket} -Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN -Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem -Dfs.cosn.read.inputstream.optimized.enabled=true -libjars /home/hadoop/hadoop-cos-8.3.10.jar,/home/hadoop/cos_api-bundle-5.6.137.2.jar -read -random -nrFiles 40 -size 4GB -bufferSize 8388608 -resFile ./50gCosnOptTestDFSIOreadrandom.log