Tuning the Hive Engine for Big Data Management®

Tuning the Hive Engine for Big Data Management®

hive-site.xml

hive-site.xml

In the Administrator tool, you can tune the hive-site.xml file properties.
The following figure shows the properties of a cluster configuration in the
Connections
tab:
Configure the following properties in hive-site.xml:
mapred.compress.map.output
Determines whether the map phase's output is compressed or not. Set to false by default. Informatica recommends turning it on for better performance by setting this parameter to true.
mapred.map.output.compression.codec
Specifies the compression codec to be used for map output compression. Default is set to org.apache.hadoop.io.compress.DefaultCodec. Snappy codec is recommended for better performance. For more information, refer to the following codec:
org.apache.hadoop.io.compress.SnappyCodec.
mapred.map.tasks.speculative.execution
Specifies whether the map tasks can be speculatively executed. Default is set to true. With speculative map task execution, duplicate tasks are spawned for the tasks that are not making much progress. Original and speculative tasks are considered alike. The task that completes first is considered and the other is killed.
Informatica recommends keeping the default value set to true for better performance.
mapred.reduce.tasks.speculative.execution
Specifies whether the reduce tasks can be speculatively executed. Default is set to true. This is similar to map task speculative execution in functionality.
Informatica recommends setting mapred.reduce.tasks.speculative.execution to false to disable the property.
mapred.min.split.size and mapred.max.split.size
Use these two properties in conjunction with the dfs.block.size property. These parameters impact the number of input splits and hence the parallelism.
Informatica recommends using the following formula for each map task on a data block:
mapred.min.split.size < dfs.block.size < mapred.max.split.size
The input split size is calculated by the following formula:
max(minimumSize, min(maximumSize, blockSize))
hive.exec.compress.intermediate
Determines whether the results of intermediate map and reduce jobs in a Hive query are compressed or not. Default is set to false. This should not be confused with the mapred.compress.map.output property that deals with the compression of the output of map task.
Informatica recommends setting the hive.exec.compress.intermediate property to true to enable the property. hive.exec.compress.intermediate uses the same codec specified by mapred.output.compression.codec, and SnappyCodec is recommended for better performance.

Number of Reduce Tasks

You can use the following properties to determine the number of reduce tasks for a mapping:
mapred.reduce.tasks
Specifies the number of reduce tasks per job. Informatica provides a default value set to -1. This enables Hive to automatically determine the number of reducers for a job.
hive.exec.reducers.max
Determines the maximum number of reduce tasks that can be created for a job when the mapred.reduce.tasks property is set to '-1'.
hive.exec.reducers.bytes.per.reducer
Determines the number of reducers by way of specifying the size of data per reducer. Default value is 1GB. For example, for an input size of 10GB, 10 reducers are used.
For properties related to reduce tasks, default values work well for most mappings. However, for mappings that use a Data Processor transformation and a complex file writer, the values of these properties might require an increased number of reduce tasks.

0 COMMENTS

We’d like to hear from you!