User-Space Pseudo-Distributed Hadoop Setup
This tutorial assumes:
- ❌ No
sudo - ❌ Cannot create users/groups
- ❌ Cannot write outside
$HOME - ✅ Can run user processes
- ✅ Can write inside
$HOME - ✅ SSH client works (server must already exist on machine)
End Goal
Single-node Hadoop cluster in your home directory:
-
HDFS
-
NameNode (metadata)
- DataNode (block storage)
-
YARN
-
ResourceManager
- NodeManager
- MapReduce
Everything will live under:
$HOME/hadoop # Hadoop executables/config files
$HOME/hadoop-data # HDFS and YARN data
Prerequisites (NO sudo version)
Check Java (existing installation expected)
You cannot install Java. So verify it exists:
java -version
If Java is available, find its path:
readlink -f $(which java)
Example output:
/usr/lib/jvm/java-11-openjdk-amd64/bin/java
Then set:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Add this to your ~/.bashrc.
If Java is NOT available → you must request your sysadmin to install it.
SSH (No installation)
You cannot install openssh-server.
Check if SSH works:
ssh localhost
If it connects → good.
If it asks for password → we configure passwordless login.
Setup Passwordless SSH (User-level only)
mkdir -p ~/.ssh
chmod 700 ~/.ssh
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
# Test
ssh localhost
If login succeeds without password → ready.
Download Hadoop (Inside HOME)
cd $HOME
wget https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.4.2.tar.gz
tar -xvzf hadoop-3.4.2.tar.gz
rm hadoop-3.4.2.tar.gz
mv hadoop-3.4.2 hadoop
Now Hadoop is here:
$HOME/hadoop
Set environment variables in ~/.bashrc:
export HADOOP_HOME=$HOME/hadoop
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # adjust to your path
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Reload:
source ~/.bashrc
Hadoop Configuration (Home-directory only)
We will store ALL Hadoop data inside:
$HOME/hadoop-data
Create directories:
mkdir -p $HOME/hadoop-data/tmp
mkdir -p $HOME/hadoop-data/dfs/name
mkdir -p $HOME/hadoop-data/dfs/data
Edit hadoop-env.sh
File:
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
Set:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # adjust to your path
core-site.xml
Replace contents with:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/YOUR_USERNAME/hadoop-data/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Replace YOUR_USERNAME with output of:
whoami
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/YOUR_USERNAME/hadoop-data/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/YOUR_USERNAME/hadoop-data/dfs/data</value>
</property>
</configuration>
YOUR_USERNAME with output of:
whoami
mapred-site.xml
Edit:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/home/YOUR_USERNAME/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/home/YOUR_USERNAME/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/home/YOUR_USERNAME/hadoop</value>
</property>
</configuration>
YOUR_USERNAME with output of:
whoami
whoami
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Format and Start Hadoop
⚠ Do this only once:
hdfs namenode -format
Start services:
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Check processes:
jps
You should see:
NameNode
DataNode
ResourceManager
NodeManager
Web UI
| Service | URL |
|---|---|
| NameNode | http://localhost:9870 |
| YARN | http://localhost:8088 |
Prepare Test Data (Home-only)
Create local test directory:
mkdir -p $HOME/gutenberg
cd $HOME/gutenberg
$HADOOP_HOME/bin/wget https://www.gutenberg.org/cache/epub/5000/pg5000.txt
Create larger file:
for i in {1..100}; do cat pg5000.txt; done > pg5000_100.txt
Load Data to HDFS
hdfs dfs -mkdir -p /user/$USER/gutenberg
hdfs dfs -put $HOME/gutenberg/* /user/$USER/gutenberg
hdfs dfs -ls /user/$USER/gutenberg
Investigate Blocks
Check block size:
hdfs getconf -confKey dfs.blocksize
Check blocks:
hdfs fsck /user/$USER/gutenberg/pg5000_100.txt -files -blocks -locations
Run WordCount
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar \
wordcount \
/user/$USER/gutenberg/pg5000.txt \
/user/$USER/output
View result:
hdfs dfs -cat /user/$USER/output/part-r-00000 | head
Everything is now user-space only.
Common Issues (Non-sudo environment)
- SSH not working → ask admin to enable sshd
- Port already in use → change ports in config
- JAVA_HOME wrong → NameNode fails
- Firewall blocking localhost UI
Sort based on count
Excellent — now we’ll extend your home-directory Hadoop setup with:
✅ A second MapReduce job
✅ Written in Python (streaming mode) ✅ That sorts output by count (descending) ✅ And saves final sorted results in HDFS
We’ll use Hadoop Streaming, which lets you plug in any executable (Python, Bash, etc.) as mapper/reducer.
Goal
Input:
word \t count
Output:
count \t word
Sorted by count (descending).
We’ll chain:
- WordCount (already done)
- Python Streaming Job → sort by count
1. Verify WordCount Output
Your previous job created:
/user/$USER/output
Check it:
hdfs dfs -cat /user/$USER/output/part-r-00000 | head
It looks like:
A 32
AND 15
THE 1002
2. Create Python Mapper & Reducer
Create a directory:
mkdir -p $HOME/streaming-sort
cd $HOME/streaming-sort
Mapper: swap.py
This mapper swaps (word, count) → (count, word)
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
if not line:
continue
word, count = line.split()
# Emit: count as key, word as value
print(f"{int(count):010d}\t{word}")
Why :010d?
Zero padding ensures proper lexical sorting:
0000000010
0000000100
Without padding, Hadoop sorts lexicographically (wrong order).
Make executable:
chmod +x swap.py
Reducer: identity_reduce.py
Reducer just prints sorted input.
#!/usr/bin/env python3
import sys
for line in sys.stdin:
print(line.strip())
Make executable:
chmod +x identity_reduce.py
Run Hadoop Streaming Job
Hadoop provides streaming jar:
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.2.jar
Run:
hadoop jar \
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.2.jar \
-input /user/$USER/output \
-output /user/$USER/sorted-output \
-mapper $HOME/streaming-sort/swap.py \
-reducer $HOME/streaming-sort/identity_reduce.py \
-file $HOME/streaming-sort/swap.py \
-file $HOME/streaming-sort/identity_reduce.py
If output directory exists:
hdfs dfs -rm -r /user/$USER/sorted-output
Then rerun.
View Sorted Output
hdfs dfs -cat /user/$USER/sorted-output/part-00000 | head
You’ll see:
0000000001 a
0000000001 ability
0000000002 above
...
⚠ This is ascending order (default Hadoop sort).
Sort in Descending Order
We need a custom comparator.
Add this option:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options=-r
Full command:
hadoop jar \
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.2.jar \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options=-r \
-input /user/$USER/output \
-output /user/$USER/sorted-output-desc \
-mapper $HOME/streaming-sort/swap.py \
-reducer $HOME/streaming-sort/identity_reduce.py \
-file $HOME/streaming-sort/swap.py \
-file $HOME/streaming-sort/identity_reduce.py
Final Output
hdfs dfs -cat /user/$USER/sorted-output-desc/part-00000 | head
Now you’ll see:
0000001023 the
0000000897 and
0000000784 of
...
Conceptual Pipeline
Raw Text
↓
WordCount (Java)
↓
(word, count)
↓
Streaming Mapper (swap)
↓
(count, word)
↓
Shuffle & Sort (Hadoop)
↓
Sorted Output