Interacting with HDFS via Command Line and Java API
Command Line Interface
Syntax Foundation
Both hadoop fs and hdfs dfs commands serve as entry points for HDFS operations. They are functionally identical and can be used interchangeably.
Available Commands Overview
$ hdfs dfs
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] <path> ...]
[-cp [-f] [-p] <src> ... <dst>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] <localsrc> ... <dst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
Practical Command Examples
Environment Setup
Initialize the Hadoop services before executing operations:
$ sbin/start-dfs.sh
$ sbin/start-yarn.sh
Display help documentation for a specific command:
$ hdfs dfs -help rm
Create a new directory structure:
$ hdfs dfs -mkdir /data_repo
Data Upload Operations
-moveFromLocal: Transfers a local file to HDFS and removes the local copy.
$ echo "alpha_data" > alpha.txt
$ hdfs dfs -moveFromLocal ./alpha.txt /data_repo
-copyFromLocal: Duplicates a local file in to HDFS while preserving the original.
$ echo "beta_data" > beta.txt
$ hdfs dfs -copyFromLocal beta.txt /data_repo
-put: Functions identically to copyFromLocal and is the preferred syntax in production environments.
$ echo "gamma_data" > gamma.txt
$ hdfs dfs -put ./gamma.txt /data_repo
-appendToFile: Appends local file content to the end of an existing HDFS file.
$ echo "supplement" > append_source.txt
$ hdfs dfs -appendToFile append_source.txt /data_repo/alpha.txt
Data Download Operations
-copyToLocal: Replicates an HDFS file to the local filesystem.
$ hdfs dfs -copyToLocal /data_repo/alpha.txt ./
-get: Equivalent to copyToLocal, widely adopted as the standard download command.
$ hdfs dfs -get /data_repo/alpha.txt ./alpha_backup.txt
Distributed Filesystem Management
-ls: Enumerate directory contents.
$ hdfs dfs -ls /data_repo
-cat: Output file contents to the console.
$ hdfs dfs -cat /data_repo/alpha.txt
-chgrp, -chmod, -chown: Modify file permissions and ownership, mirroring standard Linux filesystem commands.
$ hdfs dfs -chmod 666 /data_repo/alpha.txt
$ hdfs dfs -chown admin:devgroup /data_repo/alpha.txt
-mkdir: Generate new directory paths.
$ hdfs dfs -mkdir /archive_zone
-cp: Duplicate files between different HDFS locations.
$ hdfs dfs -cp /data_repo/alpha.txt /archive_zone
-mv: Relocate files within the HDFS namespace.
$ hdfs dfs -mv /data_repo/gamma.txt /archive_zone
$ hdfs dfs -mv /data_repo/beta.txt /archive_zone
-tail: Display the final kilobyte of a file.
$ hdfs dfs -tail /archive_zone/alpha.txt
-rm: Remove individual files.
$ hdfs dfs -rm /data_repo/alpha.txt
-rm -r: Recursively delete directories and their contents.
$ hdfs dfs -rm -r /data_repo
-du: Calculate storage consumption for directories.
$ hdfs dfs -du -s -h /archive_zone
27 81 /archive_zone
$ hdfs dfs -du -h /archive_zone
14 42 /archive_zone/alpha.txt
7 21 /archive_zone/beta.txt
6 18 /archive_zone/gamma.txt
The first numeric column indicates raw file size, the second reflects total storage used accounting for replication (raw size multiplied by the replication facter), and the third is the queried path.
-setrep: Adjust the replication factor for a specified file.
$ hdfs dfs -setrep 10 /archive_zone/alpha.txt
The updated replication count is recorded in NameNode metadata immediately. However, the actual number of physical replicas is constrained by the available DataNode count. A cluster with only 3 nodes cannot physically sustain 10 replicas; the target count will only be achieved once the cluster scales to 10 or more nodes.
Java API Integration
Development Environment Configuration
Extract the Windows Hadoop binaries (e.g., hadoop-3.1.0) to a path free of non-ASCII characters (such as D:\).
Establish the HADOOP_HOME system environment variable pointing to this directory, and append %HADOOP_HOME%\bin to the Path variable. If variables fail to resolve, a system reboot may be necessary. Execute winutils.exe to verify the setup; missing Microsoft runtime libraries can trigger errors, requiring installation of the corresponding redistributable package.
Create a Maven project and declare the required dependencies:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>
Define logging behavior by creating log4j.properties inside src/main/resources:
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
Instantiate a basic connection client:
public class HdfsConnection {
@Test
public void createDirectory() throws IOException, URISyntaxException, InterruptedException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
hdfs.mkdirs(new Path("/project_dir/sub_folder/"));
hdfs.close(); }
}
Specifying the user identity during FileSystem.get is mandatory. Without it, the API defaults to the current operating system user, which frequently results in AccessControlException permission denials on HDFS.
API Implementation Scenarios
File Upload and Configuration Priority
@Test
public void transferLocalToHdfs() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
conf.set("dfs.replication", "2");
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
hdfs.copyFromLocalFile(new Path("d:/local_data.txt"), new Path("/project_dir/sub_folder"));
hdfs.close();}
Placing hdfs-site.xml in the project resource directory overrides default settings:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Configuration values resolve in a strict priority hierarchy: values set programmatically in client code override all others, followed by user-defined configuration files on the classpath, then server-side custom configurations (xxx-site.xml), and finally server-side defaults (xxx-default.xml).
File Download
@Test
public void retrieveFromHdfs() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
// delSrc: whether to remove the HDFS source post-transfer
// src: HDFS file location
// dst: local destination path
// useRawLocalFileSystem: enable checksum validation
hdfs.copyToLocalFile(false, new Path("/project_dir/sub_folder/local_data.txt"), new Path("d:/retrieved_data.txt"), true);
hdfs.close();}
File Relocation and Renaming
@Test
public void modifyFileName() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
hdfs.rename(new Path("/project_dir/sub_folder/local_data.txt"), new Path("/project_dir/sub_folder/renamed_data.txt"));
hdfs.close();}
Directory and File Deletion
@Test
public void removePath() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
hdfs.delete(new Path("/project_dir"), true);
hdfs.close();}
File Metadata Inspection
Retrieve attributes including permissions, ownership, size, and block locations:
@Test
public void examineFileAttributes() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
RemoteIterator<LocatedFileStatus> iterator = hdfs.listFiles(new Path("/"), true);
while (iterator.hasNext()) {
LocatedFileStatus status = iterator.next();
System.out.println("========" + status.getPath() + "=========");
System.out.println(status.getPermission());
System.out.println(status.getOwner());
System.out.println(status.getGroup());
System.out.println(status.getLen());
System.out.println(status.getModificationTime());
System.out.println(status.getReplication());
System.out.println(status.getBlockSize());
System.out.println(status.getPath().getName());
BlockLocation[] blocks = status.getBlockLocations();
System.out.println(Arrays.toString(blocks));
}
hdfs.close();}
Distinguishing Files from Directories
@Test
public void classifyPathTypes() throws IOException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");
FileStatus[] statuses = hdfs.listStatus(new Path("/"));
for (FileStatus status : statuses) {
if (status.isFile()) {
System.out.println("f:" + status.getPath().getName());
} else {
System.out.println("d:" + status.getPath().getName());
}
}
hdfs.close();}