Core concepts

1. Distributed Filesystems (DFS)

To process massive datasets with MapReduce, you first need a way to store them spanning multiple machines.
- Blocks: Files are split into large fixed-size chunks (e.g., 128MB).
- Replication: Each block is replicated (usually 3x) across different nodes for fault tolerance.
- NameNode/Master: Stores metadata (file directory tree and the mapping of file blocks to DataNodes).
- DataNode/Worker: Stores the actual data blocks.

2. Data Locality

Moving computation is cheaper than moving data. Instead of pulling data to a central processing server, the MapReduce framework schedules the Map task on the node where the data already resides (or as close as possible).

3. Shared-Nothing Architecture

Nodes do not share memory or disk storage.
They communicate only via the network.
This makes the system linearly scalable; adding more nodes adds more processing power and storage without complex locking mechanisms.

4. Fault Tolerance

In a cluster of thousands of machines, failure is the norm, not the exception. - Worker Failure: The Master detects a worker node is unresponsive (heartbeats stop). It re-schedules the tasks of that worker on other nodes that have a distinct replica of the data.
- Stragglers: If a machine is slow (but not dead), the Master can launch a "Speculative Execution" (backup copy) of the task on another machine. The result from whoever finishes first is taken.

Complete MapReduce Execution Pipeline

Job Submission & Planning (Before Everything)

Client submits job (jar + config).
Framework:
- Computes number of input splits
- Determines number of map tasks
- Determines number of reduce tasks
- Initializes metadata structures
Schedules tasks close to data (data locality).

Input Splits

Logical View

The input dataset is logically divided into splits. Each split generates one Map task.

Formal Input to Mapper

[ ]

( k_1 ) = input key (e.g., byte offset)
( v_1 ) = input value (e.g., line of text)

Systems Perspective

Splits usually match HDFS block size (e.g., 128MB).
Map tasks are scheduled on nodes that physically contain the data.
Reading is sequential and streaming.

Map Phase

Logical View

User defines: map(k_1, v_1) => list() Mapper processes records independently.

Example (WordCount)

map(offset, line) => (word, 1)

2.1 Map-Side Buffering (Hidden but Critical)

Mapper does NOT immediately write output to disk. Intermediate pairs are stored in an in-memory circular buffer. - Size controlled by:

mapreduce.task.io.sort.mb

The buffer stores: - Serialized key-value data - Metadata (partition, offsets)

2.2 Spill Phase (When Buffer Fills)

When buffer reaches threshold (~80% full): 1. Records are sorted by key 2. Partitioned by reducer 3. Optionally passed through Combiner 4. Written to local disk as a spill file

This produces:

spill_1
spill_2
spill_3
...

Each spill file is:

Sorted
Partitioned

2.3 Combiner (Optional Optimization)

If defined:

combine(k_2, list(v_2)) => list()

Executed: - During spill - Or during merge Constraints: - Must be associative - Must be commutative Purpose: - Reduce intermediate data size - Reduce network traffic

2.4 Map-Side Merge Phase

If multiple spill files exist: - Framework performs multi-way external merge sort. - Produces ONE final map output file.

This file is: - Sorted by key - Partitioned by reducer - Indexed (offsets for each partition)

Now mapper is complete.

3 Shuffle and Sort Phase

This is fully handled by the framework.

But it is NOT “magic”. It has multiple sub-phases:

3.1 Partitioning

Each intermediate key ( k_2 ) is assigned:

partition = hash(k_2)\mod R

Where: - ( R ) = number of reducers

Each reducer is responsible for one partition.

3.2 Copy Phase (Network Transfer)

Each reducer: - Contacts all mappers - Fetches its partition This is: - Parallel - Over network - Often compressed

This stage is usually the biggest bottleneck.

3.3 Reduce-Side Buffering

Reducer receives segments:
- Small segments → memory
- Large segments → disk
If memory fills:
- Segments spill to disk

3.4 Reduce-Side Merge Phase

Reducer performs multi-way external merge:

segment1
segment2
segment3
...

Merged into one sorted stream.

Now framework guarantees:

Grouped and sorted by key.

4️Reduce Phase

Logical View

User defines: reduce(k_2, list(v_2)) => list()

Reducer processes one key at a time.

Important Property

Reducer sees values as an iterator, not full list in memory. Streaming execution:

for key in sorted_keys:
    process values

Memory-efficient.

5 Output Phase

Reducer writes:

to HDFS.

Properties: - One output file per reducer - Written sequentially - Replicated (e.g., 3 copies) - Durable

Unlike map output, reduce output survives failures.

Failure Handling (Cross-Cutting Concern)

If Map Fails:

Task is re-executed.
Intermediate output recomputed.

If Reducer Fails:

Restarted.
Refetches partitions.

Why It Works:

Deterministic tasks
Side-effect free user code
Intermediate output recomputable

Complete Expanded Version (Compact Structured Form)

#### The Phases

0. Job Setup:
   - Job submitted
   - Input splits computed
   - Tasks scheduled (data locality)

1. Input Splits:
   - Logical division of input into splits
   - Each split → one Map task
   - Mapper Input: <k1, v1>

2. Map Phase:
   - User function:
       map(k1, v1) → list(<k2, v2>)
   - Output written to in-memory buffer

2.1 Buffering & Spill:
   - Buffer fills
   - Records sorted by key
   - Partitioned by reducer
   - Optional Combiner applied
   - Written to local disk as spill files

2.2 Map-Side Merge:
   - Multiple spills merged
   - Final sorted, partitioned map output produced

3. Shuffle and Sort:
   - Partition: hash(k2) % R
   - Reducers fetch partitions over network
   - Reduce-side buffering
   - External merge sort
   - Framework groups values:
       <k2, list(v2)>

4. Reduce Phase:
   - User function:
       reduce(k2, list(v2)) → list(<k3, v3>)
   - Streaming execution per key

5. Output:
   - Results written to HDFS
   - One file per reducer
   - Durable storage

Hadoop Terminology

Daemons (Services)

NameNode: Master of the filesystem (HDFS). Keeps the directory tree and knows where all blocks are.
DataNode: Worker of the filesystem. Stores the actual blocks on its HDD.
ResourceManager (YARN): The master arbitrator of all cluster resources.
NodeManager (YARN): The per-machine agent dealing with containers on a single node.

Job Components

Job: A full execution unit (Mapper + Reducer + Config).
Task: An individual slice of work (Map Task or Reduce Task) running on one machine.
Driver: The code that submits the job and configures it.
Partitioner: Decides which Reducer gets which key (usually hash(key) % num_reducers).
Combiner: A "Mini-Reducer" running locally after the Map phase to reduce network traffic (e.g., pre-summing counts).