You can monitor statistics and view log events for a Spark engine mapping job in the Monitor tab of the Administrator tool. You can also monitor mapping jobs for the Spark engine in the YARN web user interface.
The following image shows the Monitor tab in the Administrator tool:
The Monitor tab has the following views:
Summary Statistics
Use the
Summary Statistics
view to view graphical summaries of object states and distribution across the Data Integration Services. You can also view graphs of the memory and CPU that the Data Integration Services used to run the objects.
Execution Statistics
Use the
Execution Statistics
view to monitor properties, run-time statistics, and run-time reports. In the Navigator, you can expand a Data Integration Service to monitor
Ad Hoc Jobs
or expand an application to monitor deployed mapping jobs or workflows
When you select
Ad Hoc Jobs
, deployed mapping jobs, or workflows from an application in the Navigator of the
Execution Statistics
view, a list of jobs appears in the contents panel. The contents panel displays jobs that are in the queued, running, completed, failed, aborted, and cancelled state. The Data Integration Service submits jobs in the queued state to the cluster when enough resources are available.
The contents panel groups related jobs based on the job type. You can expand a job type to view the related jobs under it.
Access the following views in the
Execution Statistics
view:
Properties
The
Properties
view shows the general properties about the selected job such as name, job type, user who started the job, and start time of the job.
Spark Execution Plan
When you view the Spark execution plan for a mapping, the Data Integration Service translates the mapping to a Scala program and an optional set of commands. The execution plan shows the commands and the Scala program code.
Summary Statistics
The
Summary Statistics
view appears in the details panel when you select a mapping job in the contents panel. The
Summary Statistics
view displays the following throughput statistics for the job:
Pre Job Task. The name of each job task that reads source data and stages the row data to a temporary table before the Spark job runs. You can also view the bytes and average bytes processed for each second.
Source. The name of the source.
Target. The name of the target.
Rows. For source, the number of rows read by the Spark application. For target, the sum of rows written to the target and rejected rows.
Post Job Task. The name of each job task that writes target data from the staged tables. You can also view the bytes and average bytes processed for each second.
When a mapping contains a Union transformation with multiple upstream sources, the sources appear in a comma separated list in a single row under Sources.
In a Hive mapping with an Update Strategy transformation containing the DD_UPDATE condition, the target contains only the temporary tables after the Spark job runs. The result of the mapping job statistics appears in the post job task and indicates twice the number of records updated.
The following image shows the
Summary Statistics
view in the details panel for a mapping run on the Spark engine:
You can also view the Spark run stages information in the details pane of the Summary Statistics view on the Execution Statistics Monitor tab. It appears as a list after the sources and before the targets.
The
Spark Run Stages
displays the absolute counts and throughput of rows and bytes related to the Spark application stage statistics. Rows refer to the number of rows that the stage writes, and bytes refer to the bytes broadcasted in the stage.
The following image displays the Spark Run Stages:
For example, the Spark Run Stages column contains the Spark application staged information starting with
stage_<ID>
. In the example,
Stage_0
shows the statistics related to the Spark run stage with
ID=0
in the Spark application.
Consider when the Spark engine reads source data that includes a self-join with verbose data enabled. In this scenario, the optimized mapping from the Spark application does not contain any information on the second instance of the same source in the Spark engine logs.
Consider when you read data from the temporary table and the Hive query of the customized data object leads to the shuffling of the data. In this scenario, the filtered source statistics appear instead of reading from the temporary source table in the Spark engine log.
When you run a mapping with Spark monitoring enabled, performance varies based on the mapping complexity. It can take up to three times longer than usual processing time with monitoring enabled. By default, monitoring is disabled.
Detailed Statistics
The
Detailed Statistics
view appears in the details panel when you select a mapping job in the contents panel. The
Detailed Statistics
view displays a graph of the row count for the job run.
The following image shows the
Detailed Statistics
view in the details panel for a mapping run on the Spark engine: