Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Back Next

Mass Ingestion Service

Mass ingestion is a big data solution that you can use to replicate or ingest data from different relational sources to a data lake or a Hadoop cluster.

To improve job performance, consider the best practices in the following areas:

Data Integration Service deployment: Create a Data Integration Service that is dedicated to mass ingestion jobs. Deploy mass ingestion specifications to the dedicated service.
Sqoop concurrency: Sqoop pool size determines the number of deployed Sqoop jobs that you can run concurrently in the Hadoop environment.

By default, Sqoop pool size is set to 100. You can disable the Hadoop batch execution pool by setting the value of the Maximum Hadoop Batch Execution Pool Size property to -1. This forces the Data Integration Service to control Sqoop concurrency by treating each Sqoop job equally.

The following table describes the Maximum Hadoop Batch Execution Pool Size property:

Property
Description
Reccomended
Value

Maximum Hadoop Batch Execution Pool Size

Maximum number of deployed jobs that can run concurrently in the Hadoop environment. The Data Integration Service moves Hadoop mapping jobs from the queue to the Hadoop job pool when enough resources are available.
Default: 100.

-1

See the "Best Practices for Highly Concurrent Workloads" section of the Data Integration Service topic.
Relational database concurrency: To allow concurrent Sqoop jobs to establish connections to the database, ensure that the database supports concurrent database connections.
Sqoop performance: Mass ingestion uses Sqoop jobs to ingest data from relational tables to Hive or HDFS targets on the Hadoop cluster. By default, Sqoop jobs spawn 4 tasks. Each task establishes one connection to the relational database.

To reconfigure the number of tasks in a Sqoop job, configure the following Sqoop argument in the JDBC connection:

Argument
Description

-num-mappers

-m

Number of mappers (tasks) to run concurrently. Default is 4 when Sqoop jobs run on the Spark engine.

When you reconfigure the JDBC connection, the changes affect all Sqoop jobs that use the connection. For example, if you reconfigure the JDBC connection to increase the number of tasks to 10, then 50 concurrent Sqoop jobs spawn 500 tasks and require 500 database connections.
Polling time: The polling time determines how often the Mass Ingestion tool updates the status of ingestion jobs.

By default, the polling time is 30 seconds. You can decrease the polling time to increase the number of requests and refresh the ingestion job status more frequently.

To configure the polling time, set the following custom property on the Mass Ingestion Service:
POLLING_TIME=<time in seconds>
Mass ingestion specifications: When you create a mass ingestion specification, ingest no more than 2,000 relational tables in the specification.

Rename Saved Search

Table of Contents

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Mass Ingestion Service

Mass Ingestion Service