Monday, 17 August 2015

MapReduce

The MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce
Jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work
so close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the
data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data
Resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone
network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node
spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the
running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes
to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be
viewed from a web browser.
The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user define
Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a
map phase and a reduce phase. The input to the computation is a data set of key/value pairs.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having
many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with
small runtime overhead.

Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure

• High throughput

Execution overview:

1.The MapReduce library in the user program first splits input files into M pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the assigned input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. The locations of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers.
5.When a reduce worker is notified by the master about these locations, it uses RPC remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program---the MapReduce call in the user program returns back to the user code. The output of the mapreduce execution is available in the R output files (one per reduce task).

No comments:

Post a Comment