Profiling and Discovery Sizing Guidelines

Back Next

Profile Deployment

As part of profile deployment, you need to plan the resources for profile deployment in the development environment and production environment. The Profiling Service Module has a set of parameters that controls the performance of a profiling job. You must configure the parameters for each deployment.

When you plan a profile deployment, you need to consider the profile job type, response time, user type, and data sources.

The following categories determine the system resource recommendations:

Resource guidelines for the Profiling Service Module and the Data Integration Service, including memory, disk space, and CPU usage.

Resource guidelines for column profiling, key discovery, functional dependency discovery, foreign key discovery, and overlap discovery based on the data source types and hardware capacity.

Profile Job Type

You can have multiple profile jobs when you run a profile on a data source. Each profile operation uses a different combination of resources. The mix of profile jobs determines the resource requirements. You need to balance the performance goals and resource costs effectively to optimize the deployment.

The following table summarizes the relative use of resources by each profile job type and data source:

Profile Operation	Data Source Type	CPU	Memory	Disk Space	RDBMS	Profiling Warehouse
Column Profile	Flat File	Medium	Low	Medium	None	Medium
Column Profile	Relational	Low	Low	None	High	Medium
Data Domain Discovery	Flat File	High	Low	Medium	None	Low
Data Domain Discovery	Relational	Medium	Low	None	High	Low
Key Discovery	-	Low	High	High	None	Low
Functional Dependency Discovery	-	Low	High	High	None	Low
Overlap Discovery	-	High	Low	None	None	Low
Foreign Key Discovery	-	High	Low	None	None	Low
Enterprise Discovery	Flat File	High	High	High	None	High
Enterprise Discovery	Relational	High	High	High	High	High
Reporting or Viewing Results	-	Low	None	None	None	Low
Drilldown	Flat File	Low	None	None	None	None
Drilldown	Relational	Low	None	None	Low	None

Response Time

The speed of a profile job run depends on the type of the profile job and resource types that the profile job uses. Most of the algorithms benefit from faster CPUs and memory because the operating system can use memory in different ways including caching data.

If the profile job has multithreaded algorithms, you can add additional CPU cores to improve the response time. Some algorithms perform better with faster or additional temporary disk.

The network speed is critical when the Data Integration Service queries or writes data to the profiling warehouse in another machine. The network speed is also important when the Data Integration Service running on one machine pushes queries to the RDBMS on another machine.

The following table summarizes the resource types for the Data Integration Service that increase response time when you add more or better resources for each resource type:

Profile Job Type	Faster CPU	Cores	Memory	Disk	Network
Column Profile	Yes	Yes	No	Yes	Yes
Data Domain Discovery	Yes	Yes	No	Yes	Yes
Key Discovery	Yes	No	Yes	Yes	No
Functional Dependency Discovery	Yes	No	Yes	Yes	No
Overlap Discovery	Yes	Yes	No	No	No
Foreign Key Discovery	Yes	Yes	No	No	No
Enterprise Discovery	Yes	Yes	Yes	Yes	Yes
Reporting or Viewing Results	Yes	No	No	No	Yes
Drilldown	Yes	No	No	No	Yes

User Types

The profile workload including system-generated profile jobs, such as periodic scorecard runs, depends on the number of users and type of users. When the number of users increases, more profile jobs run concurrently. The concurrent jobs indicate a range of the number of average profiling jobs for each profile type that can run successfully for the specified number of cores and memory. The type of profiling jobs that you need to estimate for depends on the type of user and resources.

Each user type might generate the following profile jobs:

Informatica Analyst user: Submits profile jobs, such as profile run, scorecard run, and drill-down jobs.
Informatica Developer user: Runs all the profile job types including enterprise discovery. In the Developer tool, the profile job type depends on the project.
infacmd command line utility user: Schedules scorecard runs but these users can run all profile jobs.

Pushdown Optimization for Data Sources

The effective use of the computing resource allocation depends on the data source type . When you run a profile on a relational source, the Profiling Service Module can transfer some of the profiling logic to the data source. The source system must be able to accommodate the additional workload. When you run a profile on a non-relational data source, the Profiling Service Module needs to compute the profiling job in the Data Integration Service. You can allocate all the computing resources to the system that runs the Informatica application. The pushdown of the processing logic also depends on the rule type and profile type.

The following guidelines determine the pushdown optimization for column profiles and rules:

Pushdown optimization applies only to physical data sources.

Pushdown optimization applies only to the following rules:

Rules containing a single expression transformation or internal expression rule with a single Boolean output port type.

Reusable validation rules that contain a single validation expression transformation.

Rules created in the Analyst tool.

Pushdown optimization does not apply to the following data objects:

Logical data object and mapping specification

Pushdown optimization does not apply to the profiling logic. However, the Data Integration Service machine can optimize the logical data object and mapping specification mappings and push down parts of the mappings before the Data Integration Service applies the profiling logic.

Mapping specification

Flat file

Mainframe source

Pushdown optimization does not apply to the following rules:

Rules with multiple transformations.

Rules with a single, non-Boolean output port.

Reusable rules.

Rules that contain IIF(), Ltrim(), or Rtrim() function.

Pushdown optimization does not apply to columns with the Date data type.

The Profiling Service Module pushes the value frequency computation and rule logic to the data source for column profiles, data domain discovery profiles, and enterprise discovery profiles. The Profiling Service Module pushes the filter logic to the data source for key discovery and functional discovery for a single table, and overlap discovery and foreign key discovery for multiple tables.

If a column profile run does not push down the value frequencies, the Data Integration Service does not push down the rules.

The following table summarizes the resource allocation between the Profiling Service Module and data source system based on the pushdown of the processing logic:

Profile Job Type	Pushdown	Database	Profile Service Module
Column Profile	Yes	Medium	Medium
Column Profile	No	None	High
Data Domain Discovery	Yes	Medium	Medium
Data Domain Discovery	No	None	High
Key Discovery	Yes	None	High
Key Discovery	No	None	High
Functional Dependency Discovery	Yes	None	High
Functional Dependency Discovery	No	None	High
Overlap Discovery	Yes	None	High
Overlap Discovery	No	None	High
Foreign Key Discovery	Yes	None	High
Foreign Key Discovery	No	None	High
Enterprise Discovery	Yes	Medium	Medium
Enterprise Discovery	No	None	High
Reporting or Viewing Results	Yes	None	Medium
Reporting or Viewing Results	No	None	High
Drilldown	Yes	Medium	Low
Drilldown	No	None	High

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profile Deployment

Profile Deployment

Profile Job Type

Response Time

User Types

Pushdown Optimization for Data Sources