Profiling and Discovery Sizing Guidelines

Back Next

Hardware Guidelines for Enterprise Discovery

The Profiling Service Module processes a data source sample to infer the keys and functional dependencies. The bandwidth requirement for flat files and relational databases is less because the data size is usually small.

Enterprise discovery is a process that systematically runs profiles in multiple ways across multiple sources. The profile types that are a part of the enterprise discovery are column profile, data domain discovery, primary key discovery, and foreign key discovery. Enterprise discovery runs a profile on each data source and then runs a foreign key discovery on all sources. A mixture of column profiles, data domain discovery and primary key discovery determines the required resources based on the value you configure for the Max Concurrent Profiling Jobs parameter.

The primary use case for enterprise discovery is to run a profile on relational sources. You might also need to consider a secondary use case for flat file sources and other non-relational sources.

Relational Source: When multiple data sources in enterprise discovery are relational sources, the Profiling Service Module pushes some parts of the column profile computation and data domain computation to the database. The Profiling Service Module performs all the computation for primary key discovery and foreign key discovery.

Consider the following hardware configuration for the relational sources:

Component
Requirement

CPU
Data domain discovery performs better if the Data Integration Service machine has multiple CPU cores or processing threads. If primary key discovery results in large intermediate results, primary key performs better with a Data Integration Service machine that has faster CPU speed and faster disk access.

Memory
Memory requirements for profile jobs except primary key discovery are minimal because the profile jobs do not buffer data. When you perform primary key discovery, the Data Integration Service machine requires some memory. You can add memory to speed up the pushdown operations for column profiles and data domain discovery.

Disk
If primary key discovery leads to large intermediate profile results, the profile jobs use some amount of temporary disk. If the Data Integration Service machine has faster access to two or more physical disks, the primary key profile jobs perform better. Column profiles and data domain discovery do not use temporary disk space.
Non-relational Sources: When you run enterprise discovery on non-relational sources, the Profiling Service Module performs all the profile computations. The CPU, memory, and temporary disk requirements depend on the specific profile function that enterprise discovery runs. The primary key discovery and foreign key discovery consumes temporary disk space.

Component	Requirement
CPU	Data domain discovery performs better if the Data Integration Service machine has multiple CPU cores or processing threads. If primary key discovery results in large intermediate results, primary key performs better with a Data Integration Service machine that has faster CPU speed and faster disk access.
Memory	Memory requirements for profile jobs except primary key discovery are minimal because the profile jobs do not buffer data. When you perform primary key discovery, the Data Integration Service machine requires some memory. You can add memory to speed up the pushdown operations for column profiles and data domain discovery.
Disk	If primary key discovery leads to large intermediate profile results, the profile jobs use some amount of temporary disk. If the Data Integration Service machine has faster access to two or more physical disks, the primary key profile jobs perform better. Column profiles and data domain discovery do not use temporary disk space.

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Hardware Guidelines for Enterprise Discovery

Hardware Guidelines for Enterprise Discovery