Hadoop reduce task

By default, there is only one reduce task, this means it will output part-00000 including all keys.

In face of huge amount of data, the reduce procedure of mapreduce program is extremely slow, we can improve the speed by increasing the number of reduce task. By setting mapreduce.reduce.tasks = num, the output will become part-00000, part-00001, part-00002, ..., same number as that of reduce tasks.

Since reduce makes partition by key, same key will always in the same partition, therefore it is guaranteed that different part-00000 will never have duplicate keys.

If we need to merge all the outputs to one final file, we can do a mapreduce to these outputs one more time.