Tuning the Hive Engine for Big Data Management®

Tuning the Hive Engine for Big Data Management®

Case Study: Increasing the Number of Reduce Tasks

Case Study: Increasing the Number of Reduce Tasks

This example demonstrates the benefit of modifying the value of 'mapred.reduce.tasks'.

JSON R2H Mapping

The following image shows the mapping:
The mapping reads two files, Name and Address, that contain normalized data and uses the Data Processor transformation to convert the relational input from the two source files to a JSON denormalized format. It then uses a complex file writer to write the result to an HDFS flat file target.
The Name file is 23 GB and the Address file is 40 GB.
The Data Processor conversion of relational data to a hierarchical JSON object happens during the reduce phase. As the combined input size is about 63 GB, approximately 63 reduce tasks are created with the default value of '-1' for 'mapred.reduce.tasks'.
The Data Processor transformation is CPU intensive. Therefore, increasing the number of reduce tasks can improve performance.
To improve performance, the parameter 'mapred.reduce.task' was modified from '-1' to '210'.

Result

Increasing the number of reduce tasks resulted in a 56% improvement in performance.

0 COMMENTS

We’d like to hear from you!