Configure the Hadoop run-time environment in the Developer tool to optimize mapping performance and process data that is greater than 10 terabytes. In the Hadoop environment, the Data Integration Service pushes the processing to nodes on a Hadoop cluster. When you select the Hadoop environment, you can also select the engine to push the mapping logic to the Hadoop cluster.
You can run standalone mappings, mappings that are a part of a workflow in the Hadoop environment.
Based on the mapping logic, the Hadoop environment can use the following engines to push processing to nodes on a Hadoop cluster:
Informatica Blaze engine. An Informatica proprietary engine for distributed processing on Hadoop.
Spark engine. A high performance engine for batch processing that can run on a Hadoop cluster or on a Spark standalone mode cluster.
Hive engine. A batch processing engine that uses Hadoop technology such as MapReduce or Tez.
You can select which engine the Data Integration Service uses. Informatica recommends that you select all engines. When you select more than one engine, the Data Integration Service determines the best engine to run the mapping during validation.
When you run a mapping in the Hadoop environment, you must configure a Hadoop connection for the mapping. When you edit the Hadoop connection, you can set the run-time properties for the Hadoop environment and the properties for the engine that runs the mapping.
You can view the execution plan for a mapping to run in the Hadoop environment. View the execution plan for the engine that the Data Integration Service selects to run the mapping.
You can monitor Hive queries and the Hadoop jobs in the Monitoring tool. Monitor the jobs on a Hadoop cluster with the YARN Web User Interface or the Blaze Job Monitor web application.
The Data Integration Service logs messages from the DTM, the Blaze engine, the Spark engine, and the Hive engine in the run-time log files.