Essential Linux Commands for Data Engineers: A Comprehensive Guide
Introduction
As a Data Engineer, proficiency in Linux commands is crucial for managing data pipelines, handling files, and working with distributed systems. This guide covers essential Linux commands with practical examples relevant to data engineering tasks.
Table of Contents
Navigation Commands
Working Directory Management
# Show current directory
pwd
# Output: /home/dataeng
# Go to home directory
cd ~
# Go up one level
cd ..
# Go to specific directory (absolute path)
cd /data/warehouse
# Go to relative path
cd ./raw_data
Pro Tip: Use pwd
frequently when running data pipelines to ensure you're in the correct directory for file operations.
Directory Listing
# List files
ls
# Detailed listing with permissions and timestamps
ls -l
# Sort by time (newest first)
ls -lt
# Sort by time (oldest first)
ls -ltr
# Show hidden files
ls -a
# Recursive listing (useful for data directories)
ls -R
File Operations
Creating and Copying Files
# Create empty file
touch data.csv
# Copy file
cp source.csv destination.csv
# Copy with preservation of metadata (important for data lineage)
cp -p source.csv destination.csv
# Copy directory recursively
cp -R /source_dir /target_dir
Moving and Renaming Files
# Move file
mv source.csv /new/location/
# Rename file
mv oldname.csv newname.csv
# Move multiple files to directory
mv file1.csv file2.csv target_dir/
File Removal
# Remove file
rm filename.csv
# Remove directory and contents
rm -r directory_name
# Force remove (use with caution!)
rm -f locked_file.csv
File Viewing and Editing
Viewing File Contents
# View entire file
cat data.csv
# View first 10 lines
head data.csv
# View first n lines
head -n 20 data.csv
# View last 10 lines
tail data.csv
# View last n lines
tail -n 20 data.csv
# Follow log file in real-time (crucial for monitoring data pipelines)
tail -f pipeline.log
File Editing
# Open file in vi editor
vi data.csv
# Basic vi commands:
# i - enter insert mode
# esc - exit insert mode
# :w - save
# :q - quit
# :wq - save and quit
# :q! - quit without saving
File Permissions
Understanding Permissions
# View file permissions
ls -l
# Output: -rw-r--r-- 1 dataeng datagrp 1024 Jan 4 10:00 data.csv
# Permission structure:
# r (read) = 4
# w (write) = 2
# x (execute) = 1
# Change permissions
chmod 644 data.csv # Owner: rw-, Group: r--, Others: r--
chmod 755 script.sh # Owner: rwx, Group: r-x, Others: r-x
Search and Pattern Matching
Using grep
# Search for pattern in file
grep "ERROR" pipeline.log
# Case insensitive search
grep -i "error" pipeline.log
# Recursive search in directory
grep -r "FAILED" /logs/
# Count occurrences
grep -c "SUCCESS" pipeline.log
Finding Files
# Find files by name
find /data -name "*.csv"
# Find files modified in last 24 hours
find /data -mtime -1
# Find and execute command
find /data -name "*.tmp" -exec rm {} \;
Data Processing Commands
Basic Data Processing
# Count lines in file
wc -l data.csv
# Sort data
sort data.csv > sorted_data.csv
# Remove duplicates
sort data.csv | uniq > unique_data.csv
# Split large files
split -l 1000000 large_file.csv chunk_
Data Transformation
# Extract specific columns (using cut)
cut -d',' -f1,2 data.csv > subset.csv
# Replace text
sed 's/old/new/g' data.csv > modified.csv
# Filter rows
awk -F',' '$3 > 1000' data.csv > filtered.csv
Common Data Engineering Scenarios
Log Analysis
# Find error patterns in logs
grep "ERROR" app.log | cut -d' ' -f1,2 | sort | uniq -c
# Monitor failed jobs
tail -f pipeline.log | grep --line-buffered "FAILED"
# Calculate success rate
echo "Success rate: $(grep -c "SUCCESS" job.log)/$(wc -l < job.log)"
Data Pipeline Operations
# Monitor disk usage
du -h /data/warehouse
# Check file counts
ls -1 /data/input | wc -l
# Verify file integrity
md5sum data.csv > checksum.txt
Compression and Archiving
# Compress files
gzip large_file.csv
# Create tar archive
tar -czf archive.tar.gz /data/files/
# Extract tar archive
tar -xzf archive.tar.gz
Best Practices for Data Engineers
Always Use Absolute Paths in Scripts:
/data/warehouse/raw/input.csv # Better than relative paths
Check Command Success:
command && echo "Success" || echo "Failed"
Use Variables for Repeated Values:
DATA_DIR="/data/warehouse"
cd "$DATA_DIR"
Create Backup Before Operations:
cp data.csv data.csv.bak
Monitor Resource Usage:
df -h # Check disk space
top # Monitor processes
Tips for Remote Operations
# Secure copy files between servers
scp data.csv user@remote:/data/
# Execute remote commands
ssh user@remote "ls -l /data/"
# Tunnel for database connections
ssh -L 3306:localhost:3306 user@remote
Remember: Always test commands on a small dataset first, especially when using destructive operations like rm
or mv
. Consider using the -i
(interactive) flag for additional safety.