Optimize mappings and the Hadoop environment to increase performance.
You can optimize mappings and the Hadoop environment in the following ways:
Mapping Recommendations and Analysis
You can use recommendations and analysis generated by the Informatica CLAIRE engine to optimize mappings and projects to reduce development costs and improve application performance. CLAIRE uses machine learning and internal algorithms to analyze mappings individually or in groups. The Developer tool displays the results of this analysis as recommendations and insights.
Configure a highly available Hadoop cluster
You can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster. The steps to configure a highly available Hadoop cluster depend on the type of Hadoop distribution. For more information about configuration steps for a Hadoop distribution, see the
Data Engineering Integration Guide
Compress data on temporary staging tables
You can enable data compression on temporary staging tables to increase mapping performance.
Run mappings on the Blaze engine
Run mappings on the highly available Blaze engine. The Blaze engine enables restart and recovery of grid tasks and tasklets by default.
Perform parallel sorts
When you use a Sorter transformation in a mapping, the Data Integration Service enables parallel sorting by default when it pushes the mapping logic to the Hadoop cluster. Parallel sorting improves mapping performance.
Partition Joiner transformations
When you use a Joiner transformation in a Blaze engine mapping, the Data Integration Service can apply map-side join optimization to improve mapping performance. The Data Integration Service applies map-side join optimization if the master table is smaller than the detail table. When the Data Integration Service applies map-side join optimization, it moves the data to the Joiner transformation without the cost of shuffling the data.
Truncate partitions in a Hive target
You can truncate partitions in a Hive target to increase performance. To truncate partitions in a Hive target, you must choose to both truncate the partition in the Hive target and truncate the target table.
Assign resources on Hadoop clusters
You can use schedulers to assign resources on a Hadoop cluster. You can use a capacity scheduler or a fair scheduler depending on the needs of your organization.
Configure YARN queues to share resources on Hadoop clusters
You can configure YARN queues to redirect jobs on the Hadoop cluster to specific queues. The queue where a job is assigned defines the resources that are allocated to perform the job.
Label nodes in a Hadoop cluster
You can label nodes in a Hadoop cluster to divide the cluster into partitions that have specific characteristics.
Optimize Sqoop mappings on the Spark engine
The Data Integration Service can optimize the performance of Sqoop pass-through mappings that run on the Spark engine.
Enable data engineering recovery
You can enable data engineering recovery to recover mapping jobs that the Data Integration Service pushes to the Spark engine for processing.