MapReduce is a technology introduced to process data at a faster rate. Managing a lot of data processors is a cumbersome task along with parallelization and distribution, input-output scheduling, status and monitoring, fault and crash tolerance. Map reducing eases the processes involved in all those tasks.
MapReduce is the key algorithm that is used to distribute work around a cluster. This programming paradigm allows for massive scalability across hundreds or thousands of servers. The automatically parallelized programs are executed on a large cluster of commodities in the machines. The run-time system takes care of the details of partitioning the data input, scheduling the program’s execution across a set of machines, handling of machine failures, as well as managing the required intercommunication. This allows programmers sans any experience with parallel and distributed systems to easily utilize the resources of the wide system. A combination of the Map and Reduce models with an associated implementation is used for processing and generating large data sets.
The term MapReduce actually refers to two separate and distinct tasks that the programs perform.
The first task Map Job takes a set of data and converts it into another set of data, where individual elements are broken down into pairs. This makes MapReduce a highly scalable technology that can be used across different computers. Using this, many small machines can be used to process jobs that normally could not be processed by a large machine.
The second task Map Abstraction inputs a key/value pair:
1. The key is a reference to the input value.
2. Value is the data set on which to operate.
The first step is to Map, that is each worker node applies the map () function to the local data, and writes the output to a temporary storage.
The second step is to ‘shuffle’ that is Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.
The third and final step is to ‘Reduce/ ‘ The Worker nodes now process each group of output data, per key, in parallel.
The reduce abstraction starts with intermediate key / value pairs and ends with finalized key / value pairs, the starting pairs are sorted by key. The iterator supplies the values for a given key to the Reduce function. The typical function here starts with a large number of key-value pairs and ends with very few key value pairs.
Let me explain the map and reduce work together. First Map returns information and Reduce accepts information. Reduce applies a user-defined function to reduce the amount of data.
Why is this approach better?
This approach creates an abstraction for dealing with complex overheads (where the computations are simple, the overhead is messy). Here less testing is required and the removing feature makes programs much smaller and easier to use. The MapReduce libraries are assumed to work properly; otherwise user code may be required to be tested. Division of labor is also handled by the MapReduce libraries, which allow programmers to focus on actual computation.
Simplicity: Developers can choose their languages of choice like Java, c++ or Python.
Scalability: Map Reduce can process petabytes of data.
Speed: With Map Reduce, the parallel processing method solves problems in hours or minutes, which would otherwise take days.
Recovery: It is easy to recover data since it keeps the copy of same data in pairs of machines.
For maximum parallelism, you need the Maps and Reduces to be stateless and not dependent on any data generated in the same MapReduce job. You cannot control the order in which the maps run, or the reductions.
Repeating similar searches again and again shows inefficiency. A database with an index always works faster than running an MR job on non-indexed data. MR job enjoys an edge if the index needs to be regenerated whenever data is added, and data is being added continually. The inefficiency can be measured in terms of CPU time and power consumed.
In the Hadoop implementation, Reduce operations do not take place until all the Maps are completed (or have failed and been skipped). As a result, you do not get any data back until the entire mapping is finished.
There is a general assumption that the output of the Reduce is smaller than the input to the Map. It means you are taking a large data source and generating smaller final values.
Map Reduce is an ideal solution for large-scale data processing. Its popularity lies in its ability to handle thousands of data on servers, and easy-to-understand semantics. It is also popular due to its high fault tolerance.