Essential HDFS Commands for Data Engineers: A Complete Guide
Introduction
Understanding HDFS (Hadoop Distributed File System) commands is crucial for any Data Engineer working with Big Data. This guide will walk you through essential HDFS commands, their usage patterns, and real-world scenarios.
Basic HDFS Command Structure
There are two equivalent ways to interact with HDFS:
hadoop fs [commands]
hdfs dfs [commands]
Both commands achieve the same results. Choose whichever you find more comfortable.
Navigation and File Listing
Basic Listing Commands
# List files in your HDFS home directory
hadoop fs -ls
# List files in specific HDFS directory
hadoop fs -ls /user/username
# List files recursively
hadoop fs -ls -R /user/username
# List files with human-readable sizes
hadoop fs -ls -h /user/username
# Sort by timestamp (newest first)
hadoop fs -ls -t /user/username
# Sort by size
hadoop fs -ls -S -h /data_warehouse
Pro Tip: The -h
flag makes file sizes human-readable (KB, MB, GB) instead of bytes.
Directory Management
# Create directory
hadoop fs -mkdir /user/username/new_dir
# Create nested directories
hadoop fs -mkdir -p /user/username/dir1/dir2/dir3
# Remove empty directory
hadoop fs -rmdir /user/username/empty_dir
# Remove directory and its contents
hadoop fs -rm -R /user/username/dir_with_contents
File Operations
Copying Files Between Local and HDFS
From Local to HDFS:
# Using put
hadoop fs -put localfile.txt /user/username/
# Using copyFromLocal
hadoop fs -copyFromLocal bigdata.csv /user/username/data/
# Put with overwrite if file exists
hadoop fs -put -f localfile.txt /user/username/
From HDFS to Local:
# Using get
hadoop fs -get /user/username/hdfs_file.txt local_directory/
# Using copyToLocal
hadoop fs -copyToLocal /user/username/hdfs_file.txt .
Operations Within HDFS
# Copy files within HDFS
hadoop fs -cp /source/path/file.txt /destination/path/
# Move files within HDFS
hadoop fs -mv /source/path/file.txt /destination/path/
# Remove file
hadoop fs -rm /user/username/unwanted_file.txt
# Remove directory and contents
hadoop fs -rm -R /user/username/unwanted_directory
File Content Operations
Viewing File Contents
# View entire file
hadoop fs -cat /user/username/file.txt
# View first few lines
hadoop fs -head /user/username/file.txt
# View last few lines
hadoop fs -tail /user/username/file.txt
# Merge small files
hadoop fs -getmerge /user/username/small_files/* merged_file.txt
Storage and System Information
Storage Usage
# Check disk usage
hadoop fs -df -h
# Get directory space usage
hadoop fs -du -h /user/username
# Get summary of directory usage
hadoop fs -du -s -h /user/username/*
File System Check
# Check file status and block locations
hdfs fsck /user/username/important_file.txt -files -blocks -locations
# Get filesystem health report
hdfs dfsadmin -report
Advanced Operations
Permission Management
# Change file permissions
hadoop fs -chmod 644 /user/username/file.txt
# Change file ownership
hadoop fs -chown username:group /user/username/file.txt
# Change group ownership
hadoop fs -chgrp newgroup /user/username/file.txt
File Checksums and Verification
# Get file checksum
hadoop fs -checksum /user/username/file.txt
# Test file existence
hadoop fs -test -e /user/username/file.txt
Common Data Engineering Scenarios
Data Pipeline Operations
# Check if directory is empty
hadoop fs -ls /user/username/input_dir | wc -l
# Move processed files to archive
hadoop fs -mv /user/username/processed/* /user/username/archive/
# Clean up temporary files
hadoop fs -rm -skipTrash /user/username/temp/*
Data Validation
# Count number of lines in file
hadoop fs -cat /user/username/data.txt | wc -l
# Quick data sampling
hadoop fs -head /user/username/large_file.csv
Best Practices
Always Use Absolute Paths:
hadoop fs -ls /user/username/data # Better than relative paths
Check Commands Before Execution:
# Use count before remove
hadoop fs -count /user/username/to_delete
hadoop fs -rm -R /user/username/to_delete
Use Appropriate Flags:
# Human readable sizes
hadoop fs -du -h /user/username
# Skip trash for large deletions
hadoop fs -rm -skipTrash /user/username/large_files
Troubleshooting Tips
Permission Issues:
# Check file permissions
hadoop fs -ls -d /user/username/file.txt
# Fix permissions
hadoop fs -chmod 644 /user/username/file.txt
Space Issues:
# Check HDFS space
hadoop fs -df -h
# Find large files
hadoop fs -ls -S -h /user/username
Stay Connected!
For more Data Engineering content and updates:
Follow on LinkedIn: Mayank Aggarwal
Subscribe to YouTube: @tech.mayankagg
Read on Medium: @tech.mayankagg
Your support helps create more free educational content for the data community!