Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Back Next

hive - hive-site.xml

Configure the following properties in the

hive-site.xml

file located at:

$INFA_HOME/services/shared/Hadoop/<distr>/conf/hive-site.xml

mapred.compress.map.output: Determines whether the output of the map phase is compressed. Default is false. Informatica recommends you to tune the parameter for better performance by setting the parameter to true.
mapred.map.output.compression.codec: Specifies the compression codec used for map output compression. Default is
org.apache.hadoop.io.compress.DefaultCodec
. Informatica recommends using Snappy codec
org.apache.hadoop.io.compress.SnappyCodec
for better performance.
mapred.map.tasks.speculative.execution: Specifies whether you can speculatively run the map tasks. Default is true. With speculative map task execution, duplicate tasks are spawned for the tasks that do not make progress. Original and speculative tasks are considered alike. The task that completes first is considered and the other is killed.; Informatica recommends keeping the default value set to true for better performance.
mapred.reduce.tasks.speculative.execution: Specifies whether you can speculatively run the reduce tasks. Default is true. The
mapred.reduce.tasks.speculative.execution
property is similar to map task speculative execution in functionality. Informatica recommends setting
mapred.reduce.tasks.speculative.execution
to false to disable the property.
mapred.min.split.size and mapred.max.split.size: Use
mapred.min.split.size
and
mapred.max.split.size
properties in conjunction with the
dfs.block.size
property. These parameters impact the number of input splits and hence the parallelism.; Informatica recommends using the following formula for each map task on a data block:; mapred.min.split.size < dfs.block.size < mapred.max.split.size; Calculate the input split size with the following formula:
max(minimumSize, min(maximumSize, blockSize))
hive.exec.compress.intermediate: Determines whether the results of intermediate map and reduce jobs in a Hive query are compressed. Default is false. The
hive.exec.compress.intermediate
property is not the same as
mapred.compress.map.output
that deals with the compression of the output of the map task.; Informatica recommends setting the
hive.exec.compress.intermediate
property to true.
hive.exec.compress.intermediate
uses the same codec specified by
mapred.output.compression.codec
. Informatica recommends SnappyCodec for better performance.

To determine the number of reduce tasks for a mapping, you can set the following properties in the

hive-site.xml

file:

mapred.reduce.tasks: Specifies the number of reduce tasks for a job. Default is -1, which enables Hive to automatically determine the number of reducers for a job.
hive.exec.reducers.max: Determines the maximum number of reduce tasks that can be created for a job when you set the
mapred.reduce.tasks
property to -1.
hive.exec.reducers.bytes.per.reducer: Determines the number of reducers when you specify the size of data for each reducer. Default is 1 GB. For example, for an input size of 10 GB, 10 reducers are used.; For properties related to reduce tasks, default values work suitably for most mappings. However, for mappings that use a Data Processor transformation and a complex file writer, the values of these properties might require an increased number of reduce tasks.

Rename Saved Search

Table of Contents

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

hive - hive-site.xml

hive - hive-site.xml