Optimize the Hadoop environment to increase performance.
You can optimize the Hadoop environment in the following ways:
Configure a highly available Hadoop cluster.
You can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster. The steps to configure a highly available Hadoop cluster depend on the type of Hadoop distribution. For more information about configuration steps for a Hadoop distribution, see the
Informatica Big Data Management Hadoop Integration Guide
.
Compress data on temporary staging tables.
You can enable data compression on temporary staging tables to increase mapping performance.
Run mappings on the Blaze engine.
Run mappings on the highly available Blaze engine. The Blaze engine enables restart and recovery of grid tasks and tasklets by default.
Perform parallel sorts.
When you use a Sorter transformation in a mapping, the Data Integration Service enables parallel sorting by default when it pushes the mapping logic to the Hadoop cluster. Parallel sorting improves mapping performance.
Partition Joiner transformations.
When you use a Joiner transformation in a Blaze engine mapping, the Data Integration Service can apply map-side join optimization to improve mapping performance. The Data Integration Service applies map-side join optimization if the master table is smaller than the detail table. When the Data Integration Service applies map-side join optimization, it moves the data to the Joiner transformation without the cost of shuffling the data.
Truncate partitions in a Hive target.
You can truncate partitions in a Hive target to increase performance. To truncate partitions in a Hive target, you must choose to both truncate the partition in the Hive target and truncate the target table.
Assign resources on Hadoop clusters.
You can use schedulers to assign resources on a Hadoop cluster. You can use a capacity scheduler or a fair scheduler depending on the needs of your organization.
Configure YARN queues to share resources on Hadoop clusters.
You can configure YARN queues to redirect jobs on the Hadoop cluster to specific queues. The queue where a job is assigned defines the resources that are allocated to perform the job.
Label nodes in a Hadoop cluster.
You can label nodes in a Hadoop cluster to divide the cluster into partitions that have specific characteristics.
Optimize Sqoop mappings on the Spark engine.
The Data Integration Service can optimize the performance of Sqoop pass-through mappings that run on the Spark engine.