Profiling and Discovery Sizing Guidelines

Back Next

Profiling Service Module Resources

The Profiling Service Module is a component of the Data Integration Service that manages requests to run profiles and generate scorecards. You can plan the system performance well if you understand the architecture of the Profiling Service Module and concepts of performance and data sources.

Supported Data Sources: The Profiling Service Module can access all supported data sources. Each category of data source has distinct performance characteristics. You can plan the profile deployment and troubleshoot performance well if you understand the differences of each category of data sources.

The following data sources have different performance characteristics when you run profiles:

Data Source
Description

Flat file source

The Profiling Service Module reads each row in a flat file source. The Profiling Service Module can construct rows by reading bytes from a flat file as required.

When you run a profile on flat file data sources, the Profiling Service Module runs all the processing logic in the mapping, including sorting and buffering.

Relational source

Relational data sources contain an SQL query engine that you can use to view the data in a front-end application. The Profiling Service Module shares the processing logic with the relational database for some of the profile jobs.

If the Profiling Service Module and relational source are in two different machines, the Data Integration Service distributes the processing logic across the resources of the two machines.

You can optimize the relational source for the profile queries that results in the increase of performance.

Semi-structured source
Avro, JSON, Parquet, and XML formats are semi-structured data sources. You can create flat file data objects for JSON or XML data sources. You can create complex file data objects for Avro, JSON, Parquet, and XML data sources in Hadoop Distributed File System (HDFS).

Mainframe source

If the mainframe source is nonrelational, such as IMS or VSAM, the Profiling Service Module processes the source as a flat file.

It is not recommended that you share the SQL processing queries with IBM DB2 sources because mainframe access can result in additional charges or license fees.

The Profiling Service Module considers all relational mainframe sources as special flat files and performs all the processing logic. This method reduces the number of I/O operations on the mainframe source.

Other sources

The Profiling Service Module considers social media, PowerExchange, logical data object, and mapping transformation data sources as flat files.
Data Integration Service Resources: The Data Integration Service runs the Profiling Service Module, and it has base memory and variable memory requirements. The variable memory requirements are based on the number of parallel mappings.

The memory requirements are as follows:

Type
Requirements

Base Memory
The amount of memory required to run the Java Virtual Machine that the Data Integration Service uses, which is approximately 640 MB.

Variable Memory

The amount of memory required to run each Data Transformation Manager thread.

One Data Transformation Manager thread is required to run each mapping that computes a part of a profile job. This overhead is dependent on the Maximum Execution Pool Size property in the service properties. The default value of this property is 10 and the overhead is approximately 1,000 MB.

A mapping requires additional memory to read address or identity reference data. A profile that reads the output of an address validation rule may incur an additional 1 GB in memory to read and cache the address validation reference data.
Profiling Service Module Resources: The Profiling Service Module uses fewer resources to run a profile on a relational data source than a flat file data source.

Following are the CPU, memory, disk, and operating system requirements for the Profiling Service Module

CPU

The Profiling Service Module uses less than 1 CPU.

Consider the following CPU requirements for different profile types:

Column profiles. Depends on the data source type.
Relational systems. Requires less than one CPU for each Data Transformation Manager thread.
Flat files. Use approximately 2.3 CPUs for each Data Transformation Manager thread.
Key and functional dependency discovery. Requires one CPU for each Data Transformation Manager thread.
Join, foreign key, and overlap discovery. Requires two CPUs for each Data Transformation Manager thread.

When you calculate the number of CPUs required for Data Transformation Manager operations, round the total number up to the nearest integer. Disk space is a one-time cost when the Data Integration Service is installed. CPU overhead is minimal when the Data Integration Service is not running jobs.

Memory
No additional memory is required beyond the minimum needed to run the mapping.

Disk
No disk space is required.

Operating System

Use a 64-bit operating system, if possible, as a 64-bit system can handle memory sizes greater than 4 GB.

A 32-bit system works if the profiling parameter fits within the memory limitations of the system.

Data Source	Description
Flat file source	The Profiling Service Module reads each row in a flat file source. The Profiling Service Module can construct rows by reading bytes from a flat file as required. When you run a profile on flat file data sources, the Profiling Service Module runs all the processing logic in the mapping, including sorting and buffering.
Relational source	Relational data sources contain an SQL query engine that you can use to view the data in a front-end application. The Profiling Service Module shares the processing logic with the relational database for some of the profile jobs. If the Profiling Service Module and relational source are in two different machines, the Data Integration Service distributes the processing logic across the resources of the two machines. You can optimize the relational source for the profile queries that results in the increase of performance.
Semi-structured source	Avro, JSON, Parquet, and XML formats are semi-structured data sources. You can create flat file data objects for JSON or XML data sources. You can create complex file data objects for Avro, JSON, Parquet, and XML data sources in Hadoop Distributed File System (HDFS).
Mainframe source	If the mainframe source is nonrelational, such as IMS or VSAM, the Profiling Service Module processes the source as a flat file. It is not recommended that you share the SQL processing queries with IBM DB2 sources because mainframe access can result in additional charges or license fees. The Profiling Service Module considers all relational mainframe sources as special flat files and performs all the processing logic. This method reduces the number of I/O operations on the mainframe source.
Other sources	The Profiling Service Module considers social media, PowerExchange, logical data object, and mapping transformation data sources as flat files.


CPU	The Profiling Service Module uses less than 1 CPU. Consider the following CPU requirements for different profile types: Column profiles. Depends on the data source type. Relational systems. Requires less than one CPU for each Data Transformation Manager thread. Flat files. Use approximately 2.3 CPUs for each Data Transformation Manager thread. Key and functional dependency discovery. Requires one CPU for each Data Transformation Manager thread. Join, foreign key, and overlap discovery. Requires two CPUs for each Data Transformation Manager thread. When you calculate the number of CPUs required for Data Transformation Manager operations, round the total number up to the nearest integer. Disk space is a one-time cost when the Data Integration Service is installed. CPU overhead is minimal when the Data Integration Service is not running jobs.
Memory	No additional memory is required beyond the minimum needed to run the mapping.
Disk	No disk space is required.
Operating System	Use a 64-bit operating system, if possible, as a 64-bit system can handle memory sizes greater than 4 GB. A 32-bit system works if the profiling parameter fits within the memory limitations of the system.

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling Service Module Resources

Profiling Service Module Resources