Preface
Introduction to Informatica Data Engineering Integration
- Informatica Data Engineering Integration Overview
  - Example
- Data Engineering Integration Component Architecture
- Data Engineering Integration Engines
- Data Engineering Process
- Data Warehouse Optimization Mapping Example
Mappings
- Overview of Mappings
- Mapping Run-time Properties
- PreSQL and PostSQL Queries for JDBC Sources
- Sqoop Mappings in a Hadoop Environment
- Mapping Output Binding
- Rules and Guidelines for Mappings in a Non-native Environment
- Workflows that Run Mappings in a Non-native Environment
- Configuring a Mapping to Run in a Non-native Environment
- Mapping Execution Plans
- Troubleshooting Mappings in a Non-native Environment
- Mappings in the Native Environment
Mapping Optimization
- Mapping Optimization
- Mapping Recommendations and Analysis
- Enabling Data Compression on Temporary Staging Tables
  - Step 1. Enable Data Compression in the Hadoop Connection
  - Step 2. Enable Data Compression on the Hadoop Environment
- Truncating Partitions in a Hive Target
- Hive Warehouse Connector and Hive LLAP
- Scheduling, Queuing, and Node Labeling
- Data Engineering Recovery
- Spark Engine Optimization for Sqoop Pass-Through Mappings
Sources
- Overview of Sources
- PowerExchange Adapter Sources
- Sources on Databricks
- File Sources on Hadoop
- Relational Sources on Hadoop
- Hive Sources on Hadoop
  - PreSQL and PostSQL Commands
  - Rules and Guidelines for Hive Sources on the Blaze Engine
- Sqoop Sources on Hadoop
Targets
- Overview of Targets
- PowerExchange Adapter Targets
- Targets on Databricks
- File Targets on Hadoop
- Message Targets on Hadoop
- Relational Targets on Hadoop
- Hive Targets on Hadoop
- Sqoop Targets on Hadoop
  - Rules and Guidelines for Sqoop Targets
Transformations
- Overview of Transformations
- Address Validator Transformation in a Non-native Environment
  - Address Validator Transformation on the Blaze Engine
  - Address Validator Transformation on the Spark Engine
    - Address Validator Transformation in a Streaming Mapping
- Aggregator Transformation in a Non-native Environment
  - Aggregator Transformation on the Blaze Engine
  - Aggregator Transformation on the Spark Engine
    - Aggregator Transformation in a Streaming Mapping
  - Aggregator Transformation on the Databricks Spark Engine
- Case Converter Transformation in a Non-native Environment
- Classifier Transformation in a Non-native Environment
- Comparison Transformation in a Non-native Environment
- Consolidation Transformation in a Non-native Environment
  - Consolidation Transformation on the Blaze Engine
  - Consolidation Transformation on the Spark Engine
- Data Masking Transformation in a Non-native Environment
  - Data Masking Transformation on the Blaze Engine
  - Data Masking Transformation on the Spark Engine
    - Data Masking Transformation in a Streaming Mapping
- Data Processor Transformation in a Non-native Environment
- Decision Transformation in a Non-native Environment
  - Decision Transformation on the Spark Engine
- Expression Transformation in a Non-native Environment
  - Expression Transformation on the Blaze Engine
  - Expression Transformation on the Spark Engine
    - Expression Transformation in a Streaming Mapping
  - Expression Transformation on the Databricks Spark Engine
- Filter Transformation in a Non-native Environment
  - Filter Transformation on the Blaze Engine
- Hierarchical to Relational Transformation in a Non-native Environment
- Java Transformation in a Non-native Environment
  - Java Transformation on the Blaze Engine
  - Java Transformation on the Spark Engine
    - Java Transformation in a Streaming Mapping
- Joiner Transformation in a Non-native Environment
  - Joiner Transformation on the Blaze Engine
  - Joiner Transformation on the Spark Engine
    - Joiner Transformation in a Streaming Mapping
  - Joiner Transformation on the Databricks Spark Engine
- Key Generator Transformation in a Non-native Environment
- Labeler Transformation in a Non-native Environment
- Lookup Transformation in a Non-native Environment
  - Lookup Transformation on the Blaze Engine
  - Lookup Transformation on the Spark Engine
    - Lookup Transformation in a Streaming Mapping
  - Lookup Transformation on the Databricks Spark Engine
- Match Transformation in a Non-native Environment
  - Match Transformation on the Blaze Engine
  - Match Transformation on the Spark Engine
- Merge Transformation in a Non-native Environment
- Normalizer Transformation in a Non-native Environment
- Parser Transformation in a Non-native Environment
- Rank Transformation in a Non-native Environment
  - Rank Transformation on the Blaze Engine
  - Rank Transformation on the Spark Engine
    - Rank Transformation in a Streaming Mapping
  - Rank Transformation on the Databricks Spark Engine
- Relational to Hierarchical Transformation in a Non-native Environment
- Router Transformation in a Non-native Environment
- Sequence Generator Transformation in a Non-native Environment
  - Sequence Generator Transformation on the Blaze Engine
  - Sequence Generator Transformation on the Spark Engine
- Sorter Transformation in a Non-native Environment
  - Sorter Transformation on the Blaze Engine
  - Sorter Transformation on the Spark Engine
    - Sorter Transformation in a Streaming Mapping
  - Sorter Transformation on the Databricks Spark Engine
- Standardizer Transformation in a Non-native Environment
- Union Transformation in a Non-native Environment
  - Union Transformation in a Streaming Mapping
- Update Strategy Transformation in a Non-native Environment
  - Update Strategy Transformation on the Blaze Engine
  - Update Strategy Transformation on the Spark Engine
  - Update Strategy Transformation on the Databricks Spark Engine
- Weighted Average Transformation in a Non-native Environment
Python Transformation
- Python Transformation Overview
- Python Transformation Ports
- Python Transformation Advanced Properties
- Python Transformation Components
  - Resource File
  - Python Code
- Rules and Guidelines for the Python Transformation
  - Python Transformation in a Streaming Mapping
- Creating a Python Transformation
  - Creating a Reusable Python Transformation
  - Creating a Non-Reusable Python Transformation
- Example: Add an ID Column to Nonpartitioned Data
- Example: Use Partitions to Find the Highest Salary
- Use Case: Operationalize a Pre-Trained Model
Data Preview
- Overview of Data Preview
  - Connections and Cluster Distributions that Support Data Preview
- Data Preview Process
- Previewing Data
- Data Preview Interface for Hierarchical Data
- Data Preview on Transformations
- Data Preview Logs
- Rules and Guidelines for Data Preview on the Spark Engine
Cluster Workflows
- Cluster Workflows Overview
  - Cluster Workflows Platform Support
- Cluster Workflow Components
- Cluster Workflows Process
- Create Cluster Task Properties
- Mapping Task Properties
- Add a Delete Cluster Task
- Deploy and Run the Workflow
  - Monitoring Azure HDInsight Cluster Workflow Jobs
Profiles
- Profiles Overview
- Native Environment
- Hadoop Environment
  - Column Profiles for Sqoop Data Sources
- Sampling Options
- Creating a Single Data Object Profile in Informatica Developer
- Creating an Enterprise Discovery Profile in Informatica Developer
- Creating a Column Profile in Informatica Analyst
- Creating an Enterprise Discovery Profile in Informatica Analyst
- Creating a Scorecard in Informatica Analyst
- Monitoring a Profile
- Profiling Functionality Support
- Troubleshooting
Monitoring
- Overview of Monitoring
- Hadoop Environment Logs
- Blaze Engine Monitoring
- Spark Engine Monitoring
Hierarchical Data Processing
- Overview of Hierarchical Data Processing
- How to Develop a Mapping to Process Hierarchical Data
- Complex Data Types
- Complex Ports
- Complex Data Type Definitions
- Type Configuration
- Complex Operators
  - Extracting an Array Element Using a Subscript Operator
  - Extracting a Struct Element Using the Dot Operator
- Complex Functions
- Midstream Parsing of Hierarchical Data
Hierarchical Data Processing Configuration
- Hierarchical Data Conversion
- Convert Relational or Hierarchical Data to Struct Data
  - Creating a Struct Port
- Convert Relational or Hierarchical Data to Nested Struct Data
  - Creating A Nested Complex Port
- Extract Elements from Hierarchical Data
  - Extracting Elements from a Complex Port
- Flatten Hierarchical Data
  - Flattening a Complex Port
Hierarchical Data Processing with Schema Changes
- Overview of Hierarchical Data Processing with Schema Changes
- How to Develop a Dynamic Mapping to Process Schema Changes in Hierarchical Data
- Dynamic Complex Ports
  - Dynamic Ports and Dynamic Complex Ports
  - Dynamic Complex Ports in Transformations
- Input Rules for a Dynamic Complex Port
- Port Selectors for Dynamic Complex Ports
- Dynamic Expressions
  - Example - Dynamic Expression to Construct a Dynamic Struct
- Complex Operators
- Complex Functions
- Rules and Guidelines for Dynamic Complex Ports
- Optimized Mappings
Intelligent Structure Models
- Overview of Intelligent Structure Models
- Intelligent Structure Discovery Process
- Use Case
- Using an Intelligent Structure Model in a Mapping
- Rules and Guidelines for Intelligent Structure Models
- How to Develop and Run a Mapping to Process Data with an Intelligent Structure Model
  - Mapping Example
- Create an Intelligent Structure Model in Cloud Data Integration
Blockchain
- Blockchain Overview
  - Blockchain Process
- Blockchain Data Objects
- Blockchain Data Object Operations
- Use Case: Using a Blockchain Source to Improve Services in a Vehicle Lifecycle
  - Mapping Overview
Stateful Computing
- Overview of Stateful Computing
- Windowing Configuration
- Window Functions
- Windowing Examples
Appendix A: Connections Reference
- Connections Overview
- Cloud Provisioning Configuration
  - AWS Cloud Provisioning Configuration Properties
  - Azure Cloud Provisioning Configuration Properties
  - Databricks Cloud Provisioning Configuration Properties
- Amazon Redshift Connection Properties
- Amazon S3 Connection Properties
- Blockchain Connection Properties
- Cassandra Connection Properties
- Databricks Connection Properties
- Google Analytics Connection Properties
- Google BigQuery Connection Properties
- Google Cloud Spanner Connection Properties
- Google Cloud Storage Connection Properties
- Hadoop Connection Properties
  - Hadoop Cluster Properties
  - Common Properties
  - Reject Directory Properties
  - Blaze Configuration
  - Spark Configuration
- HDFS Connection Properties
- HBase Connection Properties
- HBase Connection Properties for MapR-DB
- Hive Connection Properties
- JDBC Connection Properties
  - JDBC Connection String
  - Sqoop Connection-Level Arguments
- JDBC V2 Connection Properties
- Kafka Connection Properties
- Microsoft Azure Blob Storage Connection Properties
- Microsoft Azure Cosmos DB SQL API Connection Properties
- Microsoft Azure Data Lake Storage Gen1 Connection Properties
- Microsoft Azure Data Lake Storage Gen2 Connection Properties
- Microsoft Azure SQL Data Warehouse Connection Properties
- Snowflake Connection Properties
- Creating a Connection to Access Sources or Targets
- Creating a Hadoop Connection
- Configuring Hadoop Connection Properties
  - Cluster Environment Variables
  - Cluster Library Path
  - Common Advanced Properties
  - Blaze Engine Advanced Properties
  - Spark Advanced Properties
Appendix B: Data Type Reference
- Data Type Reference Overview
- Transformation Data Type Support in a Non-native Environment
- Complex File and Transformation Data Types
- Hive Data Types and Transformation Data Types
  - Hive Complex Data Types
- Sqoop Data Types
Appendix C: Function Reference
- Function Support in a Non-native Environment
- Function and Data Type Processing

User Guide

10.4.0
- 10.5.8
- 10.5.7
- 10.5.6
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2
- 10.2.1

Back Next

Spark Engine Monitoring

You can monitor statistics and view log events for a Spark engine mapping job in the Monitor tab of the Administrator tool. You can also monitor mapping jobs for the Spark engine in the YARN web user interface.

The following image shows the Monitor tab in the Administrator tool:

The Monitor tab is selected in the Administrator tool. The Execution Statistics view is selected, and the navigator shows Ad Hoc Jobs selected on the left. A list of jobs appears in the contents panel.

The Monitor tab has the following views:

Summary Statistics

Use the

Summary Statistics

view to view graphical summaries of object states and distribution across the Data Integration Services. You can also view graphs of the memory and CPU that the Data Integration Services used to run the objects.

Execution Statistics

Use the

Execution Statistics

view to monitor properties, run-time statistics, and run-time reports. In the Navigator, you can expand a Data Integration Service to monitor

Ad Hoc Jobs

or expand an application to monitor deployed mapping jobs or workflows

When you select

Ad Hoc Jobs

, deployed mapping jobs, or workflows from an application in the Navigator of the

Execution Statistics

view, a list of jobs appears in the contents panel. The contents panel displays jobs that are in the queued, running, completed, failed, aborted, and cancelled state. The Data Integration Service submits jobs in the queued state to the cluster when enough resources are available.

The contents panel groups related jobs based on the job type. You can expand a job type to view the related jobs under it.

Access the following views in the

Execution Statistics

view:

Properties: The
Properties
view shows the general properties about the selected job such as name, job type, user who started the job, and start time of the job.
Spark Execution Plan: When you view the Spark execution plan for a mapping, the Data Integration Service translates the mapping to a Scala program and an optional set of commands. The execution plan shows the commands and the Scala program code.
Summary Statistics: The
Summary Statistics
view appears in the details panel when you select a mapping job in the contents panel. The
Summary Statistics
view displays the following throughput statistics for the job:

Pre Job Task. The name of each job task that reads source data and stages the row data to a temporary table before the Spark job runs. You can also view the bytes and average bytes processed for each second.
If you enable recovery for a Sqoop mapping, the pre job task statistics does not display.

Source. The name of the source.
Target. The name of the target.
Rows. For source, the number of rows read by the Spark application. For target, the sum of rows written to the target and rejected rows.
Post Job Task. The name of each job task that writes target data from the staged tables. You can also view the bytes and average bytes processed for each second.; When a mapping contains a Union transformation with multiple upstream sources, the sources appear in a comma separated list in a single row under Sources.

In a Hive mapping with an Update Strategy transformation containing the DD_UPDATE condition, the target contains only the temporary tables after the Spark job runs. The result of the mapping job statistics appears in the post job task and indicates twice the number of records updated.

The following image shows the
Summary Statistics
view in the details panel for a mapping run on the Spark engine:; You can also view the Spark run stages information in the details pane of the Summary Statistics view on the Execution Statistics Monitor tab. It appears as a list after the sources and before the targets.; The
Spark Run Stages
displays the absolute counts and throughput of rows and bytes related to the Spark application stage statistics. Rows refer to the number of rows that the stage writes, and bytes refer to the bytes broadcasted in the stage.
The following image displays the Spark Run Stages:

For example, the Spark Run Stages column contains the Spark application staged information starting with
stage_<ID>
. In the example,
Stage_0
shows the statistics related to the Spark run stage with
ID=0
in the Spark application.

Consider when the Spark engine reads source data that includes a self-join with verbose data enabled. In this scenario, the optimized mapping from the Spark application does not contain any information on the second instance of the same source in the Spark engine logs.

Consider when you read data from the temporary table and the Hive query of the customized data object leads to the shuffling of the data. In this scenario, the filtered source statistics appear instead of reading from the temporary source table in the Spark engine log.

When you run a mapping with Spark monitoring enabled, performance varies based on the mapping complexity. It can take up to three times longer than usual processing time with monitoring enabled. By default, monitoring is disabled.

Detailed Statistics: The
Detailed Statistics
view appears in the details panel when you select a mapping job in the contents panel. The
Detailed Statistics
view displays a graph of the row count for the job run.

The following image shows the
Detailed Statistics
view in the details panel for a mapping run on the Spark engine: