Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

hive - hive-site.xml

hive - hive-site.xml

Configure the following properties in the
hive-site.xml
file located at:
$INFA_HOME/services/shared/Hadoop/<distr>/conf/hive-site.xml
mapred.compress.map.output
Determines whether the output of the map phase is compressed. Default is false. Informatica recommends you to tune the parameter for better performance by setting the parameter to true.
mapred.map.output.compression.codec
Specifies the compression codec used for map output compression. Default is
org.apache.hadoop.io.compress.DefaultCodec
. Informatica recommends using Snappy codec
org.apache.hadoop.io.compress.SnappyCodec
for better performance.
mapred.map.tasks.speculative.execution
Specifies whether you can speculatively run the map tasks. Default is true. With speculative map task execution, duplicate tasks are spawned for the tasks that do not make progress. Original and speculative tasks are considered alike. The task that completes first is considered and the other is killed.
Informatica recommends keeping the default value set to true for better performance.
mapred.reduce.tasks.speculative.execution
Specifies whether you can speculatively run the reduce tasks. Default is true. The
mapred.reduce.tasks.speculative.execution
property is similar to map task speculative execution in functionality. Informatica recommends setting
mapred.reduce.tasks.speculative.execution
to false to disable the property.
mapred.min.split.size and mapred.max.split.size
Use
mapred.min.split.size
and
mapred.max.split.size
properties in conjunction with the
dfs.block.size
property. These parameters impact the number of input splits and hence the parallelism.
Informatica recommends using the following formula for each map task on a data block:
mapred.min.split.size < dfs.block.size < mapred.max.split.size
Calculate the input split size with the following formula:
max(minimumSize, min(maximumSize, blockSize))
hive.exec.compress.intermediate
Determines whether the results of intermediate map and reduce jobs in a Hive query are compressed. Default is false. The
hive.exec.compress.intermediate
property is not the same as
mapred.compress.map.output
that deals with the compression of the output of the map task.
Informatica recommends setting the
hive.exec.compress.intermediate
property to true.
hive.exec.compress.intermediate
uses the same codec specified by
mapred.output.compression.codec
. Informatica recommends SnappyCodec for better performance.
To determine the number of reduce tasks for a mapping, you can set the following properties in the
hive-site.xml
file:
mapred.reduce.tasks
Specifies the number of reduce tasks for a job. Default is -1, which enables Hive to automatically determine the number of reducers for a job.
hive.exec.reducers.max
Determines the maximum number of reduce tasks that can be created for a job when you set the
mapred.reduce.tasks
property to -1.
hive.exec.reducers.bytes.per.reducer
Determines the number of reducers when you specify the size of data for each reducer. Default is 1 GB. For example, for an input size of 10 GB, 10 reducers are used.
For properties related to reduce tasks, default values work suitably for most mappings. However, for mappings that use a Data Processor transformation and a complex file writer, the values of these properties might require an increased number of reduce tasks.

0 COMMENTS

We’d like to hear from you!