Core concepts
1. Distributed Filesystems (DFS)
To process massive datasets with MapReduce, you first need a way to store them spanning multiple machines.
- Blocks: Files are split into large fixed-size chunks (e.g., 128MB).
- Replication: Each block is replicated (usually 3x) across different nodes for fault tolerance.
- NameNode/Master: Stores metadata (file directory tree and the mapping of file blocks to DataNodes).
- DataNode/Worker: Stores the actual data blocks.
2. Data Locality
Moving computation is cheaper than moving data. Instead of pulling data to a central processing server, the MapReduce framework schedules the Map task on the node where the data already resides (or as close as possible).
3. Shared-Nothing Architecture
- Nodes do not share memory or disk storage.
- They communicate only via the network.
- This makes the system linearly scalable; adding more nodes adds more processing power and storage without complex locking mechanisms.
4. Fault Tolerance
In a cluster of thousands of machines, failure is the norm, not the exception.
- Worker Failure: The Master detects a worker node is unresponsive (heartbeats stop). It re-schedules the tasks of that worker on other nodes that have a distinct replica of the data.
- Stragglers: If a machine is slow (but not dead), the Master can launch a "Speculative Execution" (backup copy) of the task on another machine. The result from whoever finishes first is taken.
Complete MapReduce Execution Pipeline
Job Submission & Planning (Before Everything)
- Client submits job (jar + config).
- Framework:
- Computes number of input splits
- Determines number of map tasks
- Determines number of reduce tasks
- Initializes metadata structures
- Schedules tasks close to data (data locality).
Input Splits
Logical View
The input dataset is logically divided into splits. Each split generates one Map task.
Formal Input to Mapper
[
- ( k_1 ) = input key (e.g., byte offset)
- ( v_1 ) = input value (e.g., line of text)
Systems Perspective
- Splits usually match HDFS block size (e.g., 128MB).
- Map tasks are scheduled on nodes that physically contain the data.
- Reading is sequential and streaming.
Map Phase
Logical View
User defines:
map(k_1, v_1) => list(
Example (WordCount)
map(offset, line) => (word, 1)
2.1 Map-Side Buffering (Hidden but Critical)
Mapper does NOT immediately write output to disk. Intermediate pairs are stored in an in-memory circular buffer. - Size controlled by:
mapreduce.task.io.sort.mb
The buffer stores: - Serialized key-value data - Metadata (partition, offsets)
2.2 Spill Phase (When Buffer Fills)
When buffer reaches threshold (~80% full): 1. Records are sorted by key 2. Partitioned by reducer 3. Optionally passed through Combiner 4. Written to local disk as a spill file
This produces:
spill_1
spill_2
spill_3
...
Each spill file is:
- Sorted
- Partitioned
2.3 Combiner (Optional Optimization)
If defined:
combine(k_2, list(v_2)) => list(
Executed: - During spill - Or during merge Constraints: - Must be associative - Must be commutative Purpose: - Reduce intermediate data size - Reduce network traffic
2.4 Map-Side Merge Phase
If multiple spill files exist: - Framework performs multi-way external merge sort. - Produces ONE final map output file.
This file is: - Sorted by key - Partitioned by reducer - Indexed (offsets for each partition)
Now mapper is complete.
3 Shuffle and Sort Phase
This is fully handled by the framework.
But it is NOT “magic”. It has multiple sub-phases:
3.1 Partitioning
Each intermediate key ( k_2 ) is assigned:
partition = hash(k_2)\mod R
Where: - ( R ) = number of reducers
Each reducer is responsible for one partition.
3.2 Copy Phase (Network Transfer)
Each reducer: - Contacts all mappers - Fetches its partition This is: - Parallel - Over network - Often compressed
This stage is usually the biggest bottleneck.
3.3 Reduce-Side Buffering
Reducer receives segments:
- Small segments → memory
- Large segments → disk
If memory fills:
- Segments spill to disk
3.4 Reduce-Side Merge Phase
Reducer performs multi-way external merge:
segment1
segment2
segment3
...
Merged into one sorted stream.
Now framework guarantees:
Grouped and sorted by key.
4️Reduce Phase
Logical View
User defines:
reduce(k_2, list(v_2)) => list(
Reducer processes one key at a time.
Important Property
Reducer sees values as an iterator, not full list in memory. Streaming execution:
for key in sorted_keys:
process values
Memory-efficient.
5 Output Phase
Reducer writes:
to HDFS.
Properties: - One output file per reducer - Written sequentially - Replicated (e.g., 3 copies) - Durable
Unlike map output, reduce output survives failures.
Failure Handling (Cross-Cutting Concern)
If Map Fails:
- Task is re-executed.
- Intermediate output recomputed.
If Reducer Fails:
- Restarted.
- Refetches partitions.
Why It Works:
- Deterministic tasks
- Side-effect free user code
- Intermediate output recomputable
Complete Expanded Version (Compact Structured Form)
#### The Phases
0. Job Setup:
- Job submitted
- Input splits computed
- Tasks scheduled (data locality)
1. Input Splits:
- Logical division of input into splits
- Each split → one Map task
- Mapper Input: <k1, v1>
2. Map Phase:
- User function:
map(k1, v1) → list(<k2, v2>)
- Output written to in-memory buffer
2.1 Buffering & Spill:
- Buffer fills
- Records sorted by key
- Partitioned by reducer
- Optional Combiner applied
- Written to local disk as spill files
2.2 Map-Side Merge:
- Multiple spills merged
- Final sorted, partitioned map output produced
3. Shuffle and Sort:
- Partition: hash(k2) % R
- Reducers fetch partitions over network
- Reduce-side buffering
- External merge sort
- Framework groups values:
<k2, list(v2)>
4. Reduce Phase:
- User function:
reduce(k2, list(v2)) → list(<k3, v3>)
- Streaming execution per key
5. Output:
- Results written to HDFS
- One file per reducer
- Durable storage
Hadoop Terminology
Daemons (Services)
- NameNode: Master of the filesystem (HDFS). Keeps the directory tree and knows where all blocks are.
- DataNode: Worker of the filesystem. Stores the actual blocks on its HDD.
- ResourceManager (YARN): The master arbitrator of all cluster resources.
- NodeManager (YARN): The per-machine agent dealing with containers on a single node.
Job Components
- Job: A full execution unit (Mapper + Reducer + Config).
- Task: An individual slice of work (Map Task or Reduce Task) running on one machine.
- Driver: The code that submits the job and configures it.
- Partitioner: Decides which Reducer gets which key (usually
hash(key) % num_reducers). - Combiner: A "Mini-Reducer" running locally after the Map phase to reduce network traffic (e.g., pre-summing counts).