Table of Contents

Search

  1. Preface
  2. Introduction to Informatica Big Data Management
  3. Mappings
  4. Sources
  5. Targets
  6. Transformations
  7. Data Preview
  8. Cluster Workflows
  9. Profiles
  10. Monitoring
  11. Hierarchical Data Processing
  12. Hierarchical Data Processing Configuration
  13. Hierarchical Data Processing with Schema Changes
  14. Intelligent Structure Models
  15. Stateful Computing
  16. Appendix A: Connections
  17. Appendix B: Data Type Reference
  18. Appendix C: Function Reference

Optimization for the Hadoop Environment

Optimization for the Hadoop Environment

Optimize the Hadoop environment to increase performance.
You can optimize the Hadoop environment in the following ways:
Configure a highly available Hadoop cluster.
You can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster. The steps to configure a highly available Hadoop cluster depend on the type of Hadoop distribution. For more information about configuration steps for a Hadoop distribution, see the
Informatica Big Data Management Integration Guide
.
Compress data on temporary staging tables.
You can enable data compression on temporary staging tables to increase mapping performance.
Run mappings on the Blaze engine.
Run mappings on the highly available Blaze engine. The Blaze engine enables restart and recovery of grid tasks and tasklets by default.
Perform parallel sorts.
When you use a Sorter transformation in a mapping, the Data Integration Service enables parallel sorting by default when it pushes the mapping logic to the Hadoop cluster. Parallel sorting improves mapping performance.
Partition Joiner transformations.
When you use a Joiner transformation in a Blaze engine mapping, the Data Integration Service can apply map-side join optimization to improve mapping performance. The Data Integration Service applies map-side join optimization if the master table is smaller than the detail table. When the Data Integration Service applies map-side join optimization, it moves the data to the Joiner transformation without the cost of shuffling the data.
Truncate partitions in a Hive target.
You can truncate partitions in a Hive target to increase performance. To truncate partitions in a Hive target, you must choose to both truncate the partition in the Hive target and truncate the target table.
Assign resources on Hadoop clusters.
You can use schedulers to assign resources on a Hadoop cluster. You can use a capacity scheduler or a fair scheduler depending on the needs of your organization.
Configure YARN queues to share resources on Hadoop clusters.
You can configure YARN queues to redirect jobs on the Hadoop cluster to specific queues. The queue where a job is assigned defines the resources that are allocated to perform the job.
Label nodes in a Hadoop cluster.
You can label nodes in a Hadoop cluster to divide the cluster into partitions that have specific characteristics.
Optimize Sqoop mappings on the Spark engine.
The Data Integration Service can optimize the performance of Sqoop pass-through mappings that run on the Spark engine.
Enable big data job recovery.
You can enable big data job recovery to recover mapping jobs that the Data Integration Service pushes to the Spark engine for processing.


Updated July 10, 2020