Hadoop: MapReduce

Introduction

MapReduce is a core component of the Hadoop platform. The main funtionality of MapReduce is to split large amounts of data into smaller chunks and distribute the smaller chunks onto multiple servers for massively parallel processing tasks. The architecture allows splitting workloads into a massive number of smaller ones that later get re-combined into singular data sets.

In technical terms MapReduce can split a petabyte, (i.e. 1015 bytes or 1,000 terabytes), of data into chunks as small as 64 megabytes in size and work on each chunk separately, potentially even in separate machines entirely. For example, suppose you have 10 000 servers and that each one of them can process 256MB of data at a time. If you are working with 5TB of data, you could use MapReduce to distribute your data among the servers and process all of the data simultaneously in a fraction of the time.

How Does MapReduce Work?

The MapReduce algorithm contains two main functions: the afforementioned map and reduce functions.

Between the Map and Reduce functions, there are two intermediate ones: the Combine and Partition functions:

The below diagram gives you an idea of how the MapReduce pipeline works, architecturally speaking:

MapReduce Pipeline Architecture Diagram

MapReduce Example

As a simple example of how MapReduce works, suppose your Hadoop framework is using three mappers: Mapper1, Mapper2, and Mapper3. You will also use the data in the previous diagram for this example.

To begin, the input value of each mapper is a record. The mapper processes this record to produce key-value pairs so the output of each mapper will be:

Mapper1 => <Student A, 1>, <Student B, 1>, <Student C, 1>
Mapper2 => <Student B, 1>, <Student B, 1>,<Student C, 1>
Mapper3 => <Student A, 1>, <Student C, 1>, <Student B, 1>

You will further assume that you have one combiner assigned to each mapper, Combiner1, Combiner2, and Combiner3. Each of these combiners will count each student in each mapper. So the output of the combiners will be:

Combiner1 => <Student A, 1>, <Student B, 1>, <Student C, 1>
Combiner2 => <Student B, 2>, <Student C, 1>
Combiner3 => <Student A, 1>, <Student B, 1>, <Student C, 1>

AFter this phase, the partitioner allocates and sorts the data for the reducers. The output of the partitioner will be:

Partitioner1 => <Student A, 1>,{1,1}
Partitioner2 => <Student B, 1>,{1,2,1}
Partitioner3 => <Student C, 1>,{1,1,1}

If you chose not to use combiners, the output of the partitioner would be:

Partitioner1 => <Student A, 1>,{1,1}
Partitioner2 => <Student B, 1>,{1,1,1,1}
Partitioner3 => <Student C, 1>,{1,1,1}

Finally, the reducer can calculate the count for each student, giving the following output:

Reducer1 => <Student A>{2}
Reducer2 => <Student B>{4}
Reducer3 => <Student C>{3}

References

Web Links

Note Links