The reducers in a mapreduce job do not begin before all the map jobs are completed. Once all the map jobs are completed the reducers begin copying the intermediate key-value pairs from the mappers. Overall reducers start working as soon as the mappers are ready with key-value pairs.
Mapreduce works on the basis of large number of computers connected via a network also known as node. In a large network there is always a possibility that a system may not perform as quickly as others. This results in a task being delayed. By speculative execution this can be avoided as multiple instances of the same map are run on different systems.
Combiners codes are used to increase the efficiency of a mapreduce process. They basically help by reducing the amount of data that needs to be shifted across to reducers. As a safe practice the mapreduce jobs should never depend upon combiners execution.
In mapreduce framework each map function generates key values. The partition function accepts these key values and in return provides the index for a reduce. Generally the key is hashed and a modulo is done to the number of reducers.
The input reader as the name suggests primarily has two functions:
1. Reading the Input
2. Splitting it into sub-parts
The input reader accepts a user entered problem and then it divides/splits the problem into parts which then each are assigned a map function. Also an input reader will always read data from a stable storage source only to avoid problems.
The processing can occur on data which are in a file system (unstructured ) or in a database ( structured ). The mapreduce framework primarily works on two steps:
1. Map step
2. Reduce step
Map step: During this step the master node accepts an input (problem) and splits it into smaller problems. Now the node distributes the small sub problems to the worker node so that they can solve the problem.
Reduce step: Once the sub problem is solved by the worker node, the node returns a solution to the master node which accepts all the solutions of the worker node and re-compiles them into a solution. This solution is for the input that was provided to the master node.
MapReduce is a software framework that was created by Google. It`s prime focus was to aid in distributed computing, specifically large sets of data on a group of many computers. The frameworks took its inspiration from the map and reduce functions from functional programming.
Some of the shortcomings of mapreduce are:
One-input two-phase data flow is rigid i.e. it does not allow for multiple step processing of records.
Being based on a procedural programming model this framework requires code for simple operations.
The map and reduce functions being opaque does not allow for optimization easily.
The mapreduce algorithm has 4 main phases:
3. Shuttle and sort
4. Phase output
Mappers simply execute on unsorted key/values pairs.They create the intermediate keys. Once these keys are ready the combiners pair the key/value pairs with the right key. The shuttle/sort is done by the framework their role being to group data and transfer it. Once completed, it will proceed for the output via the phase output process.
Distributed grep: A line is emitted by the map function if it matches a pattern. The reduce function is an identity function that copies supplied intermediate data for output.
Term-vector per host: In this the map function emits a hostname, vector pair for every document (input). The reduce function adds all the term vectors pairs generated and discards any infrequent terms.
In mapreduce the map phase if subdivided into M pieces and the reduce phase into R pieces. Each worker is assigned a group of tasks this improves dynamic load balancing and also speeds up the recovery of a worker in case of failures.
Mapreduce framework provides a user with many different output and input types.
Ex. Each line is a key/value pair. The key is the offset of the line from the beginning of the file and the value are contents of the line. It is up-to the will of the user. Also a user can add functionality at his will to support new input and output types.
A scarce resource is one which is available in limited quantities for the system. In mapreduce the network band-with is a scarce resource. It is conserved by making use of local disks and memory in cluster to store data during tasks. The function uses the location of the input files into account and aims to schedule a task on a system which has the input files.
In a mapreduce job the master pings each worker periodically. In case a worker does not respond to that system then the system is marked as failed. Even completed tasks are rescheduled because the output was stored in a in a local disk of a worker which failed. Hence mapreduce is able to handle large-scale failures easily by simply restarting a task. The master node always saves itself at checkpoints and in case of any failure it simply restarts from that checkpoint.
The mapreduce framework contains most of the key architecture principles of cloud computing such as:
Scale: The framework is able to expand itself in direct proportion to the number of machines available.
Reliable: The framework is able to compensate for a lost node and restart the task on a different node.
Affordable: A user can start small and over time can add more hardware.
Due to the above features the mapreduce framework has become the platform of choice for the development of cloud applications.