Profiling and Discovery Sizing Guidelines

Back Next

Data Integration Service Parameters

You need to configure the Temporary Directories and Maximum Execution Pool Size parameters for the Data Integration Service. You can configure many parameters, such as Reserved Profile Threads and Maximum DB Connections, that apply to the Profiling Service Module. Before you use the parameter recommendations, verify that you have identified a node or grid and the requirement is to configure the node or grid optimally to run profiles.

Profiling Service Module and Performance

When you configure the Data Integration Service parameters specific to Profiling Service Module, you must choose whether to optimize the Profiling Service Module for the average usage or peak usage.

If you optimize the Profiling Service Module for average usage, each profile job gets more resources. When the number of profile jobs exceeds the average threshold, the persistent queue queues up the remaining profile jobs until one or more of the jobs that run finish. When you optimize the Profiling Service Module for peak usage, more profile jobs run concurrently with reduced amount of resources. Therefore, the profile jobs might take longer time to run. The peak usage configuration increases the throughput. The Data Integration Service runs profile jobs with higher priority before the lower priority jobs irrespective of the Profiling Service Module optimization.

Temporary Directories

Location of temporary directories for the Data Integration Service process on the node. This parameter has an impact on the Data Integration Service machine.

Reserved Profiling Threads

Number of threads of the Maximum Execution Pool Size parameter that are for priority requests. This parameter has an impact on the Profiling Service Module. The reserved profiling threads are the threads that the Profiling Service Module keeps in reserve to perform drill-down tasks and profile export tasks. The Profiling Service Module uses these reserved threads when it uses all the other threads for profile jobs. If there are no profiling jobs to run, the Profiling Service Module does not use the reserved threads.

The Profiling Service Module uses the nonprofile threads to perform a data preview job because the data preview job is a common service for all tools. The Mapping Service Module runs the service. The Data Integration Service considers the preview jobs as part of the total number of threads and not the Profiling Service Module.

Maximum Concurrent Profile Jobs

The maximum number of concurrent profile threads to run a profile on flat files. If you do not set this parameter, the Profiling Service Module determines the best number based on the set of running profile jobs and environment factors. This parameter has an impact on the Profiling Service Module. You can consider the default value of 5 for the Maximum Concurrent Profile Jobs parameter as the minimum value. To increase performance, you can increase the parameter value to match the number of CPU cores in the Data Integration Service machine. If you use a grid configuration, you can aggregate the number of cores for all the nodes in the grid to determine the total number of CPU cores.

Maximum DB Connections

The Maximum DB Connections parameter controls the number of parallel queries across all profiling jobs for a relational source. You can set the parameter to a value that the source database with the lowest concurrent profile run requests. The Profiling Service Module then does not impact the performance of any of the databases that you run a profile on. This parameter has an impact on the relational database sources that you run a profile on.

In the Database Server column of the following worksheet, enter the names of the databases. You can then enter the number of cores for each database server in the Cores column. The Max % Utilization is the percentage of the maximum use of the CPU cores in the database server. If the database has multiple cores, such as 64 or 128, you can set the Max % Utilization value to 90%. Otherwise, you can set the value to 80%. A guideline to increase performance is to verify that the temporary tablespace resides on a separate hard drive.

Enter the percentage values for each database server in the Max % Utilization column. Multiply the values in the Cores and Max % Utilization columns for each row and update the Cores x Max % Utilization column.

Use the following worksheet to record the values:

Database Server	Cores	Max % Utilization	Cores x Max % Utilization

Set the Maximum DB Connections parameter to the minimum value of all the values in the Cores x Max % Utilization column.

Maximum Execution Pool Size

The Maximum Execution Pool Size parameter determines the maximum number of requests that the Data Integration Service can run concurrently. Requests include data previews, mappings, and profiling jobs. This parameter has an impact on the Data Integration Service.

To calculate the total number of DTM threads for the Data Integration Service, consider all the threads for the Profiling Service Module and mappings from the other plugins and clients.

After you implement an initial configuration, you must monitor the Data Integration Service instance nodes. You need to verify that the Data Integration Service performs as expected and that you do not need to make additional adjustments. You must also perform long-term monitoring to verify that the requirements do not change over time. If the requirements change, you might need to make additional adjustments to the settings. You might also need to change other Profiling Service Module parameters that control how profiling jobs run on the Data Integration Service.

Use the following worksheet to calculate the total number of threads for the Data Integration Service. Enter the values in the Estimated Maximum Threads column for each row.

Use the following worksheet to record the values:

Activity	Estimated Maximum Threads
Profile run
Preview
Unplanned mappings
Deployed mappings
SQL Endpoint

Calculation

To calculate the Maximum Execution Pool Size parameter value, add the values in all the rows.

Maximum Profile Execution Pool Size

The Maximum Profile Execution Pool Size parameter determines the total number DTM threads that the Data Integration Service uses to run profiles. This parameter has an impact on the Profiling Service Module. You can calculate the number of DTM threads allocated to the Profiling Service Module based on the expected mix of profile jobs and the number of CPU cores.

To calculate the expected mix pf profile jobs, divide the mix of profile jobs into two groups before you calculate the number of DTM threads. The first group can include the profile jobs that the Data Integration Service cannot transfer to the relational source. The second group can include the profile jobs that the Data Integration Service can transfer to the relational source. You can use the first group proportion to compute the number of threads for the jobs that the system cannot transfer to the relational sources. You can then add this number to the Maximum DB Connections value to compute the final estimate for the number of DTM threads. Each profiling type uses a different number of CPU cores. Use different weights for different profile job types.

In the following worksheet, enter the values in the A and B columns as required. Multiply the values in the A and B columns for each row and then update the A x B column.

Use the following worksheet to record the values for profile operations that the Data Integration Service cannot transfer to the relational sources:

Profile Operation	Estimated Number for Each 100 Runs (A)	Factor (B)	A x B
Scorecard on a relational source Column profile on a relational source
Scorecard on a flat file source Column profile on a flat file Source		0.4
Data domain discovery		0.5
Key discovery		1.0
Functional dependency discovery		1.0
Overlap discovery		1.0
Foreign key discovery		1.0

Calculation

Add all the values in the A x B column and divide the total value by 100. You can then multiply the result with the number of CPU cores for the nodes in the Data Integration Service machine and the recommended load factor of 2.5. The final value is the number of DTM threads for profile jobs that the Data Integration Service cannot transfer to the relational source.

Final Calculation

The Max DB Connections parameter impacts the profile jobs that the Data Integration Service can transfer to the relational sources. To calculate the Maximum Profile Execution Pool Size value, add the following values:

The number of DTM threads for profile jobs that the Data Integration Service cannot transfer to the relational source

The Max DB Connections parameter value

The Reserved Profiling Threads parameter value

Maximum Concurrent Profile Threads

The Maximum Concurrent Profile Threads parameter determines the number of mappings that run in parallel when you run a column profile on a flat file data source or relational data source. This parameter has an impact on the profiling warehouse.

You can retain the default value of 1 for the parameter because each column profile job that runs on a flat file data source consumes approximately 2.3 CPU cores.

You can consider increasing the parameter value in the following scenarios:

There are not many column profile jobs on flat file sources and the Data Integration Service node has many CPU cores.

If you run the column profile jobs on a grid with many nodes that has an aggregate CPU core count and CPU usage.

Each column profile needs to run as fast as possible when the resources are available.

In the Value column of the following worksheet, enter the number of CPU cores that the Data Integration Service can use to run a column profile. You can then enter a value for the expected level of concurrency.

Use the following worksheet to record the values:

Metric	Value
Number of CPU cores available for column profiling
Expected level of concurrency that you want to configure

Calculation

To calculate the number of CPU cores available for each column profile job, divide the number of CPU cores available for column profiling value with the level of concurrency value. To calculate the Maximum Concurrent Profile Threads value, divide the number of CPU cores available for each column profile job with 2.3 and round off the value. The number of CPU cores that the Data Integration Service uses for each column profiling thread is 2.3.

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Data Integration Service Parameters

Data Integration Service Parameters

Profiling Service Module and Performance

Temporary Directories

Reserved Profiling Threads

Maximum Concurrent Profile Jobs

Maximum DB Connections

Maximum Execution Pool Size

Maximum Profile Execution Pool Size

Maximum Concurrent Profile Threads