Table of Contents


  1. Preface
  2. Part 1: Hadoop Integration
  3. Part 2: Databricks Integration
  4. Appendix A: Managing Distribution Packages
  5. Appendix B: Connections Reference

Cluster Integration Overview

Cluster Integration Overview

You can integrate the Informatica domain with Hadoop clusters through Data Engineering Integration.
The Data Integration Service automatically installs Hadoop binaries to integrate the Informatica domain with the Hadoop environment.
The integration requires Informatica connection objects and cluster configurations. A cluster configuration is a domain object that contains configuration parameters that you import from the Hadoop cluster. You then associate the cluster configuration with connections to access the Hadoop environment.
Perform the following tasks to integrate the Informatica domain with the Hadoop environment:
  1. Install or upgrade to the current Informatica version.
  2. Perform pre-import tasks, such as verifying system requirements and user permissions.
  3. Import the cluster configuration into the domain. The cluster configuration contains properties from the *-site.xml files on the cluster.
  4. Create a Hadoop connection and other connections to run mappings within the Hadoop environment.
  5. Perform post-import tasks specific to the Hadoop distribution that you integrate with.
When you run a mapping, the Data Integration Service checks for the binary files on the cluster. If they do not exist or if they are not synchronized, the Data Integration Service prepares the files for transfer. It transfers the files to the distributed cache through the Informatica Hadoop staging directory on HDFS. By default, the staging directory is /tmp. This transfer process replaces the requirement to install distribution packages on the Hadoop cluster.


We’d like to hear from you!