The reducers in a mapreduce job do not begin before all the map jobs are completed. Once all the map jobs are completed the reducers begin copying the intermediate key-value pairs from the mappers. Overall reducers start working as soon as the mappers are ready with key-value pairs.
Mapreduce works on the basis of large number of computers connected via a network also known as node. In a large network there is always a possibility that a system may not perform as quickly as others. This results in a task being delayed. By speculative execution this can be avoided as multiple instances of the same map are run on different systems.
Combiners codes are used to increase the efficiency of a mapreduce process. They basically help by reducing the amount of data that needs to be shifted across to reducers. As a safe practice the mapreduce jobs should never depend upon combiners execution.
In mapreduce framework each map function generates key values. The partition function accepts these key values and in return provides the index for a reduce. Generally the key is hashed and a modulo is done to the number of reducers.
The input reader as the name suggests primarily has two functions:
1. Reading the Input
2. Splitting it into sub-parts
The input reader accepts a user entered problem and then it divides/splits the problem into parts which then each are assigned a map function. Also an input reader will always read data from a stable storage source only to avoid problems.
The processing can occur on data which are in a file system (unstructured ) or in a database ( structured ). The mapreduce framework primarily works on two steps:
1. Map step
2. Reduce step
Map step: During this step the master node accepts an input (problem) and splits it into smaller problems. Now the node distributes the small sub problems to the worker node so that they can solve the problem.
Reduce step: Once the sub problem is solved by the worker node, the node returns a solution to the master node which accepts all the solutions of the worker node and re-compiles them into a solution. This solution is for the input that was provided to the master node.
MapReduce is a software framework that was created by Google. It`s prime focus was to aid in distributed computing, specifically large sets of data on a group of many computers. The frameworks took its inspiration from the map and reduce functions from functional programming.
Some of the shortcomings of mapreduce are:
One-input two-phase data flow is rigid i.e. it does not allow for multiple step processing of records.
Being based on a procedural programming model this framework requires code for simple operations.
The map and reduce functions being opaque does not allow for optimization easily.
The mapreduce algorithm has 4 main phases:
3. Shuttle and sort
4. Phase output
Mappers simply execute on unsorted key/values pairs.They create the intermediate keys. Once these keys are ready the combiners pair the key/value pairs with the right key. The shuttle/sort is done by the framework their role being to group data and transfer it. Once completed, it will proceed for the output via the phase output process.
Distributed grep: A line is emitted by the map function if it matches a pattern. The reduce function is an identity function that copies supplied intermediate data for output.
Term-vector per host: In this the map function emits a hostname, vector pair for every document (input). The reduce function adds all the term vectors pairs generated and discards any infrequent terms.