Table of Contents

Search

  1. Preface
  2. Introduction to Data Engineering Administration
  3. Authentication
  4. Running Mappings on a Cluster with Kerberos Authentication
  5. Authorization
  6. Cluster Configuration
  7. Cloud Provisioning Configuration
  8. Data Integration Service Processing
  9. Appendix A: Connections Reference
  10. Appendix B: Monitoring REST API

Data Engineering Administrator Guide

Data Engineering Administrator Guide

Common Properties

Common Properties

The following table describes the common connection properties that you configure for the Hadoop connection:
Property
Description
Impersonation User Name
Required if the Hadoop cluster uses Kerberos authentication. Hadoop impersonation user. The user name that the Data Integration Service impersonates to run mappings in the Hadoop environment.
Data Engineering Integration supports operating system profiles on all Hadoop distributions. In the Hadoop run-time environment, the Data Integration Service pushes the processing to the Hadoop cluster and the run-time engines run mappings with the operating system profile specified Hadoop impersonation properties.
Temporary Table Compression Codec
Hadoop compression library for a compression codec class name.
The Spark engine does not support compression settings for temporary tables. When you run mappings on the Spark engine, the Spark engine stores temporary tables in an uncompressed file format.
Codec Class Name
Codec class name that enables data compression and improves performance on temporary staging tables.
Hive Staging Database Name
Namespace for Hive staging tables. Use the name
default
for tables that do not have a specified database name.
If you do not configure a namespace, the Data Integration Service uses the Hive database name in the Hive target connection to create staging tables.
When you run a mapping in the native environment to write data to Hive, you must configure the Hive staging database name in the Hive connection. The Data Integration Service ignores the value you configure in the Hadoop connection.
Environment SQL
SQL commands to set the Hadoop environment. The Data Integration Service executes the environment SQL at the beginning of each Hive script generated by a HiveServer2 job.
The following rules and guidelines apply to the usage of environment SQL:
  • You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries.
  • If you use multiple values for the Environment SQL property, ensure that there is no space between the values.
Engine Type
The Data Integration Service uses HiveServer2 to process portions of some jobs by running HiveServer2 tasks on the Spark engine. When you import the cluster configuration through the Administrator tool, you can choose to create connections. The engine type property is populated by default based on the distribution.
When you manually create a connection, you must configure the engine type.
You can specify the engine type based on the following Hadoop distributions:
  • Amazon EMR. Tez
  • Azure HDI. Tez
  • Cloudera CDH. MRv2
  • Cloudera CDP. Tez
  • Dataproc. MRv2
  • Hortonworks HDP. Tez
  • MapR. MRv2
Advanced Properties
List of advanced properties that are unique to the Hadoop environment. The properties are common to the Blaze and Spark engines. The advanced properties include a list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. Mapping custom properties set using
    infacmd ms runMapping
    with the
    -cp
    option
  2. Mapping run-time properties for the Hadoop environment
  3. Hadoop connection advanced properties for run-time engines
  4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. Data Integration Service custom properties
When a mapping uses Hive Server 2 to run a job or parts of a job, you cannot override properties that are configured on the cluster level in preSQL or post-SQL queries or SQL override statements.
Workaround: Instead of attempting to use the cluster configuration on the domain to override cluster properties, pass the override settings to the JDBC URL. For example:
beeline -u "jdbc:hive2://<domain host>:<port_number>/tpch_text_100" --hiveconf hive.execution.engine=tez
Informatica does not recommend changing these property values before you consult with third-party documentation, Informatica documentation, or Informatica Global Customer Support. If you change a value without knowledge of the property, you might experience performance degradation or other unexpected results.

0 COMMENTS

We’d like to hear from you!