User Guide

Back Next

Optimization for the Hadoop Environment

Optimize the Hadoop environment to increase performance.

You can optimize the Hadoop environment in the following ways:

Configure a highly available Hadoop cluster.: You can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster. The steps to configure a highly available Hadoop cluster depend on the type of Hadoop distribution. For more information about configuration steps for a Hadoop distribution, see the
Informatica Big Data Management Integration Guide
.

Compress data on temporary staging tables.: You can enable data compression on temporary staging tables to increase mapping performance.

Run mappings on the Blaze engine.: Run mappings on the highly available Blaze engine. The Blaze engine enables restart and recovery of grid tasks and tasklets by default.

Perform parallel sorts.: When you use a Sorter transformation in a mapping, the Data Integration Service enables parallel sorting by default when it pushes the mapping logic to the Hadoop cluster. Parallel sorting improves mapping performance.

Partition Joiner transformations.: When you use a Joiner transformation in a Blaze engine mapping, the Data Integration Service can apply map-side join optimization to improve mapping performance. The Data Integration Service applies map-side join optimization if the master table is smaller than the detail table. When the Data Integration Service applies map-side join optimization, it moves the data to the Joiner transformation without the cost of shuffling the data.

Truncate partitions in a Hive target.: You can truncate partitions in a Hive target to increase performance. To truncate partitions in a Hive target, you must choose to both truncate the partition in the Hive target and truncate the target table.
Assign resources on Hadoop clusters.: You can use schedulers to assign resources on a Hadoop cluster. You can use a capacity scheduler or a fair scheduler depending on the needs of your organization.
Configure YARN queues to share resources on Hadoop clusters.: You can configure YARN queues to redirect jobs on the Hadoop cluster to specific queues. The queue where a job is assigned defines the resources that are allocated to perform the job.
Label nodes in a Hadoop cluster.: You can label nodes in a Hadoop cluster to divide the cluster into partitions that have specific characteristics.
Optimize Sqoop mappings on the Spark engine.: The Data Integration Service can optimize the performance of Sqoop pass-through mappings that run on the Spark engine.
Enable big data job recovery.: You can enable big data job recovery to recover mapping jobs that the Data Integration Service pushes to the Spark engine for processing.