Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Interacting with HDFS via Command Line and Java API

Tech 1

Command Line Interface

Syntax Foundation

Both hadoop fs and hdfs dfs commands serve as entry points for HDFS operations. They are functionally identical and can be used interchangeably.

Available Commands Overview

$ hdfs dfs
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] <path> ...]
[-cp [-f] [-p] <src> ... <dst>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] <localsrc> ... <dst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]

Practical Command Examples

Environment Setup

Initialize the Hadoop services before executing operations:

$ sbin/start-dfs.sh
$ sbin/start-yarn.sh

Display help documentation for a specific command:

$ hdfs dfs -help rm

Create a new directory structure:

$ hdfs dfs -mkdir /data_repo

Data Upload Operations

-moveFromLocal: Transfers a local file to HDFS and removes the local copy.

$ echo "alpha_data" > alpha.txt
$ hdfs dfs -moveFromLocal ./alpha.txt /data_repo

-copyFromLocal: Duplicates a local file in to HDFS while preserving the original.

$ echo "beta_data" > beta.txt
$ hdfs dfs -copyFromLocal beta.txt /data_repo

-put: Functions identically to copyFromLocal and is the preferred syntax in production environments.

$ echo "gamma_data" > gamma.txt
$ hdfs dfs -put ./gamma.txt /data_repo

-appendToFile: Appends local file content to the end of an existing HDFS file.

$ echo "supplement" > append_source.txt
$ hdfs dfs -appendToFile append_source.txt /data_repo/alpha.txt

Data Download Operations

-copyToLocal: Replicates an HDFS file to the local filesystem.

$ hdfs dfs -copyToLocal /data_repo/alpha.txt ./

-get: Equivalent to copyToLocal, widely adopted as the standard download command.

$ hdfs dfs -get /data_repo/alpha.txt ./alpha_backup.txt

Distributed Filesystem Management

-ls: Enumerate directory contents.

$ hdfs dfs -ls /data_repo

-cat: Output file contents to the console.

$ hdfs dfs -cat /data_repo/alpha.txt

-chgrp, -chmod, -chown: Modify file permissions and ownership, mirroring standard Linux filesystem commands.

$ hdfs dfs -chmod 666 /data_repo/alpha.txt
$ hdfs dfs -chown admin:devgroup /data_repo/alpha.txt

-mkdir: Generate new directory paths.

$ hdfs dfs -mkdir /archive_zone

-cp: Duplicate files between different HDFS locations.

$ hdfs dfs -cp /data_repo/alpha.txt /archive_zone

-mv: Relocate files within the HDFS namespace.

$ hdfs dfs -mv /data_repo/gamma.txt /archive_zone
$ hdfs dfs -mv /data_repo/beta.txt /archive_zone

-tail: Display the final kilobyte of a file.

$ hdfs dfs -tail /archive_zone/alpha.txt

-rm: Remove individual files.

$ hdfs dfs -rm /data_repo/alpha.txt

-rm -r: Recursively delete directories and their contents.

$ hdfs dfs -rm -r /data_repo

-du: Calculate storage consumption for directories.

$ hdfs dfs -du -s -h /archive_zone
27  81  /archive_zone

$ hdfs dfs -du -h /archive_zone
14  42  /archive_zone/alpha.txt
7   21  /archive_zone/beta.txt
6   18  /archive_zone/gamma.txt

The first numeric column indicates raw file size, the second reflects total storage used accounting for replication (raw size multiplied by the replication facter), and the third is the queried path.

-setrep: Adjust the replication factor for a specified file.

$ hdfs dfs -setrep 10 /archive_zone/alpha.txt

The updated replication count is recorded in NameNode metadata immediately. However, the actual number of physical replicas is constrained by the available DataNode count. A cluster with only 3 nodes cannot physically sustain 10 replicas; the target count will only be achieved once the cluster scales to 10 or more nodes.

Java API Integration

Development Environment Configuration

Extract the Windows Hadoop binaries (e.g., hadoop-3.1.0) to a path free of non-ASCII characters (such as D:\).

Establish the HADOOP_HOME system environment variable pointing to this directory, and append %HADOOP_HOME%\bin to the Path variable. If variables fail to resolve, a system reboot may be necessary. Execute winutils.exe to verify the setup; missing Microsoft runtime libraries can trigger errors, requiring installation of the corresponding redistributable package.

Create a Maven project and declare the required dependencies:

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.30</version>
    </dependency>
</dependencies>

Define logging behavior by creating log4j.properties inside src/main/resources:

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

Instantiate a basic connection client:

public class HdfsConnection {

    @Test
    public void createDirectory() throws IOException, URISyntaxException, InterruptedException {
        Configuration conf = new Configuration();
        FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

        hdfs.mkdirs(new Path("/project_dir/sub_folder/"));

        hdfs.close();    }
}

Specifying the user identity during FileSystem.get is mandatory. Without it, the API defaults to the current operating system user, which frequently results in AccessControlException permission denials on HDFS.

API Implementation Scenarios

File Upload and Configuration Priority

@Test
public void transferLocalToHdfs() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    conf.set("dfs.replication", "2");
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    hdfs.copyFromLocalFile(new Path("d:/local_data.txt"), new Path("/project_dir/sub_folder"));

    hdfs.close();}

Placing hdfs-site.xml in the project resource directory overrides default settings:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Configuration values resolve in a strict priority hierarchy: values set programmatically in client code override all others, followed by user-defined configuration files on the classpath, then server-side custom configurations (xxx-site.xml), and finally server-side defaults (xxx-default.xml).

File Download

@Test
public void retrieveFromHdfs() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    // delSrc: whether to remove the HDFS source post-transfer
    // src: HDFS file location
    // dst: local destination path
    // useRawLocalFileSystem: enable checksum validation
    hdfs.copyToLocalFile(false, new Path("/project_dir/sub_folder/local_data.txt"), new Path("d:/retrieved_data.txt"), true);

    hdfs.close();}

File Relocation and Renaming

@Test
public void modifyFileName() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    hdfs.rename(new Path("/project_dir/sub_folder/local_data.txt"), new Path("/project_dir/sub_folder/renamed_data.txt"));

    hdfs.close();}

Directory and File Deletion

@Test
public void removePath() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    hdfs.delete(new Path("/project_dir"), true);

    hdfs.close();}

File Metadata Inspection

Retrieve attributes including permissions, ownership, size, and block locations:

@Test
public void examineFileAttributes() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    RemoteIterator<LocatedFileStatus> iterator = hdfs.listFiles(new Path("/"), true);

    while (iterator.hasNext()) {
        LocatedFileStatus status = iterator.next();

        System.out.println("========" + status.getPath() + "=========");
        System.out.println(status.getPermission());
        System.out.println(status.getOwner());
        System.out.println(status.getGroup());
        System.out.println(status.getLen());
        System.out.println(status.getModificationTime());
        System.out.println(status.getReplication());
        System.out.println(status.getBlockSize());
        System.out.println(status.getPath().getName());

        BlockLocation[] blocks = status.getBlockLocations();
        System.out.println(Arrays.toString(blocks));
    }
    hdfs.close();}

Distinguishing Files from Directories

@Test
public void classifyPathTypes() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(new URI("hdfs://namenode-host:8020"), conf, "admin");

    FileStatus[] statuses = hdfs.listStatus(new Path("/"));

    for (FileStatus status : statuses) {
        if (status.isFile()) {
            System.out.println("f:" + status.getPath().getName());
        } else {
            System.out.println("d:" + status.getPath().getName());
        }
    }
    hdfs.close();}
Tags: hdfsHadoop

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.