Profiling and Discovery Sizing Guidelines

Back Next

High Volume Sources for Data Domain Discovery

High volume sources contain more than 100,000 rows. To reduce the aggregate computation in the data domain rules, the Profiling Service Module generates value frequencies for each column and runs them through the data domain rule. The Profiling Service Module pushes the computation to the database when you perform data domain discovery on relational data sources. When you run data domain discovery on flat files, the Profiling Service Module performs the computation.

Flat Files: The Profiling Service Module generates one mapping containing two columns in the data source and up to five data domains. This ratio maintains a balance between performing data domain discovery on a set of columns and the processing required to run the data domain rules on the columns. The Profiling Service Module runs these mappings sequentially by default. You can set the Maximum Concurrent Profiling Threads to a higher number to change the sequential run of mappings.

Consider the following hardware considerations for flat files:

Component
Requirements

CPU
The CPU usage is about 2.3 CPUs, which is the same for a column profile, for each mapping. The CPU usage can increase based on the complexity of the data domains.

Memory for Mapping
Each mapping requires at least 132 MB. The memory requirement for all data domain discovery mappings is 128 MB. The two columns in the mapping requires 2 MB each as buffer memory. If these buffers overflow, data domain discovery uses a secondary buffer of 64 MB. The total memory requirement is 196 MB. The memory requirement can be more because the transformations in data domain rules might have additional memory requirements.

Disk

A data domain discovery mapping might need disk space to perform profiling computations.

The following formula calculates the disk space for a single mapping:

Number of columns for each mapping X maximum number of rows X ((2 bytes for each character X maximum string size in characters) + frequency bytes)

where

Number of columns for each mapping is 2.
Maximum number of rows is the maximum number of rows in any flat file.
Two bytes for each character is the typical number of bytes for a single Unicode character.
Maximum string size in characters is the maximum number of characters in any column in any flat file, or 255, whichever is less.
Frequency bytes is 4. Frequency bytes store the frequency calculation during the analysis.

Perform the calculation and assign the disk space to one or more physical disks. Use one disk for each mapping. You can consider a maximum of four disks.
Relational Sources: The Profiling Service Module usually generates one mapping that contains a single column in the relational table and up to five domains. Sometimes, the Profiling Service Module might generate multiple mappings for the same column.

Each mapping pushes the value frequency query to the database to minimize the number of redundant values. This pushdown method avoids duplicate computation in the data domain rules. The Profiling Service Module runs five of these mappings in parallel. The Maximum DB Connections parameter controls the number of parallel mappings.

Consider the following hardware considerations for relational sources:

Component
Requirements

CPU
The CPU usage is about 1 CPU for each mapping. The CPU usage can increase based on the complexity of the data domains.

Memory for Mapping
Similar to the low volume processing, each mapping requires at least 128 MB if you do not consider the data domain rules. The memory requirement for each data domain rule can be more because the transformations in the data domain rules might have additional memory requirements.

Disk

A data domain discovery mapping might need disk space for temporary tablespace on the database machine to perform profiling computations.

The following formula calculates the temporary tablespace for a single mapping:

maximum number of rows X (maximum column size + frequency bytes)

where

maximum number of rows is the maximum number of rows in any table.
maximum column size is the number of bytes in any table column that is not one of the very large data types that you cannot run a profile on. An example of the very large datatype is CLOB. The column size must take into account the character encoding, such as Unicode or ASCII.
Frequency bytes is 4 or 8 bytes. Frequency bytes store the frequency during the analysis. This is the default size that the database uses for COUNT(*).

Compute the disk space and assign the temporary tablespace to one or more physical disks. Use one disk for each mapping. You can consider a maximum of four disks.

Component	Requirements
CPU	The CPU usage is about 2.3 CPUs, which is the same for a column profile, for each mapping. The CPU usage can increase based on the complexity of the data domains.
Memory for Mapping	Each mapping requires at least 132 MB. The memory requirement for all data domain discovery mappings is 128 MB. The two columns in the mapping requires 2 MB each as buffer memory. If these buffers overflow, data domain discovery uses a secondary buffer of 64 MB. The total memory requirement is 196 MB. The memory requirement can be more because the transformations in data domain rules might have additional memory requirements.
Disk	A data domain discovery mapping might need disk space to perform profiling computations. The following formula calculates the disk space for a single mapping: Number of columns for each mapping X maximum number of rows X ((2 bytes for each character X maximum string size in characters) + frequency bytes) where Number of columns for each mapping is 2. Maximum number of rows is the maximum number of rows in any flat file. Two bytes for each character is the typical number of bytes for a single Unicode character. Maximum string size in characters is the maximum number of characters in any column in any flat file, or 255, whichever is less. Frequency bytes is 4. Frequency bytes store the frequency calculation during the analysis.

Component	Requirements
CPU	The CPU usage is about 1 CPU for each mapping. The CPU usage can increase based on the complexity of the data domains.
Memory for Mapping	Similar to the low volume processing, each mapping requires at least 128 MB if you do not consider the data domain rules. The memory requirement for each data domain rule can be more because the transformations in the data domain rules might have additional memory requirements.
Disk	A data domain discovery mapping might need disk space for temporary tablespace on the database machine to perform profiling computations. The following formula calculates the temporary tablespace for a single mapping: maximum number of rows X (maximum column size + frequency bytes) where maximum number of rows is the maximum number of rows in any table. maximum column size is the number of bytes in any table column that is not one of the very large data types that you cannot run a profile on. An example of the very large datatype is CLOB. The column size must take into account the character encoding, such as Unicode or ASCII. Frequency bytes is 4 or 8 bytes. Frequency bytes store the frequency during the analysis. This is the default size that the database uses for COUNT(*).

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

High Volume Sources for Data Domain Discovery

High Volume Sources for Data Domain Discovery