You can optimize the Hadoop environment and the Hadoop cluster to increase performance.
You can optimize the Hadoop environment and the Hadoop cluster in the following ways:
Configure a highly available Hadoop cluster
You can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster. The steps to configure a highly available Hadoop cluster depend on the type of Hadoop distribution. For more information about configuration steps for a Hadoop distribution, see the
Informatica Big Data Management Hadoop Integration Guide
Compress data on temporary staging tables
You can enable data compression on temporary staging tables to increase mapping performance.
Run mappings on the Blaze engine
Run mappings on the highly available Blaze engine. The Blaze engine enables restart and recovery of grid tasks and tasklets by default.
Perform parallel sorts
When you use a Sorter transformation in a mapping, the Data Integration Service enables parallel sorting by default when it pushes the mapping logic to the Hadoop cluster. Parallel sorting improves mapping performance with some restrictions.
Partition Joiner transformations
When you use a Joiner transformation in a Blaze engine mapping, the Data Integration Service can apply map-side join optimization to improve mapping performance. The Data Integration Service applies map-side join optimization if the master table is smaller than the detail table. When the Data Integration Service applies map-side join optimization, it moves the data to the Joiner transformation without the cost of shuffling the data.
Truncate partitions in a Hive target
You can truncate partitions in a Hive target to increase performance. To truncate partitions in a Hive target, you must choose to both truncate the partition in the Hive target and truncate the target table. You can enable data compression on temporary staging tables to optimize performance.