Profiling and Discovery Sizing Guidelines

Back Next

Data Sizes and Profiling Functions Overview

Usually, the distribution and correlation of source data values control how the profile algorithms perform. The section contains general guidelines and therefore, specific profiling jobs might not conform to these guidelines.

Number of Rows: Linear scaling has the effect of "2X," which indicates that the profile algorithm takes double the time to run a profile if you double the number of rows in a data source. Non-linear scaling has an effect greater than "2X."
The following table summarizes how doubling the number of rows affects the performance of different profiling job types in terms of how the job types scale:

Profile Job Type
Effect
Description

Column profile
>= 2X
Sorting is the major component and scaling is not exactly 2X. However, for certain use cases, other components make it closer to linear scaling.

Data domain discovery
>= 2X
For flat file data sources, scaling is exactly 2X. However, similar to a column profile, scaling might be more than 2X.

Key discovery
~ 2X
Usually, scaling is linear. However, scaling is dependent on data and the complexity of relationships in the data.

Functional dependency discovery
~ 2X
Usually, scaling is linear. However, scaling is dependent on data and the complexity of relationships in the data.

Overlap discovery
Step 1. 2X
Step 2. Constant

The first step of computing the signatures is directly proportional to the number of rows. The second step takes the same amount of time.

Foreign key discovery
Step 1. 2X
Step 2. Constant

The first step of computing the signatures is directly proportional to the number of rows. The second step takes the same amount of time.

Enterprise discovery
~ 2X
Enterprise discovery is a mixture of column profiling, data domain discovery, key discovery, and foreign key discovery. Enterprise discovery scales as the average of these functions.
Number of Columns: Usually, the profile algorithm takes double the time to run a profile if you double the number of columns in a data source. Linear scaling has the effect of "2X." Non-linear scaling has an effect greater than "2X."
The following table summarizes how doubling the number of columns affects the performance of different profiling job types in terms of how the profile job types scale.

Profile Job Type
Effect
Description

Column profile
2X
Columns are independent of each other.

Data domain discovery
2X
Columns are independent of each other.

Key discovery
2 to the power of X
Sometimes, key discovery must compare all combinations of columns to find the keys. The effect is exponential in the number of columns.

Functional dependency discovery
2 to the power of X
Sometimes, key discovery must compare all combinations of columns to find the keys. The effect is exponential in the number of columns.

Overlap discovery
Step 1. 2X
Step 2. X to the power of 2

The first step of computing the signatures is directly proportional to the number of columns. The first step is linear scaling. The second step relies on comparing all columns with other columns and scales as the square of the number of columns. The second step runs faster than the first step.

Foreign key discovery
Step 1. 2X
Step 2. Constant

The first step of computing the signatures is directly proportional to the number of columns. The first step is linear scaling. The second step relies on comparing all columns with other columns and scales as the square of the number of columns. The second step runs faster than the first step.

Enterprise discovery
~ 2X
Enterprise discovery is a mixture of column profiling, data domain discovery, key discovery, and foreign key discovery. Enterprise discovery scales as the average of these functions.

Profile Job Type	Effect	Description
Column profile	>= 2X	Sorting is the major component and scaling is not exactly 2X. However, for certain use cases, other components make it closer to linear scaling.
Data domain discovery	>= 2X	For flat file data sources, scaling is exactly 2X. However, similar to a column profile, scaling might be more than 2X.
Key discovery	~ 2X	Usually, scaling is linear. However, scaling is dependent on data and the complexity of relationships in the data.
Functional dependency discovery	~ 2X	Usually, scaling is linear. However, scaling is dependent on data and the complexity of relationships in the data.
Overlap discovery	Step 1. 2X Step 2. Constant	The first step of computing the signatures is directly proportional to the number of rows. The second step takes the same amount of time.
Foreign key discovery	Step 1. 2X Step 2. Constant	The first step of computing the signatures is directly proportional to the number of rows. The second step takes the same amount of time.
Enterprise discovery	~ 2X	Enterprise discovery is a mixture of column profiling, data domain discovery, key discovery, and foreign key discovery. Enterprise discovery scales as the average of these functions.

Profile Job Type	Effect	Description
Column profile	2X	Columns are independent of each other.
Data domain discovery	2X	Columns are independent of each other.
Key discovery	2 to the power of X	Sometimes, key discovery must compare all combinations of columns to find the keys. The effect is exponential in the number of columns.
Functional dependency discovery	2 to the power of X	Sometimes, key discovery must compare all combinations of columns to find the keys. The effect is exponential in the number of columns.
Overlap discovery	Step 1. 2X Step 2. X to the power of 2	The first step of computing the signatures is directly proportional to the number of columns. The first step is linear scaling. The second step relies on comparing all columns with other columns and scales as the square of the number of columns. The second step runs faster than the first step.
Foreign key discovery	Step 1. 2X Step 2. Constant	The first step of computing the signatures is directly proportional to the number of columns. The first step is linear scaling. The second step relies on comparing all columns with other columns and scales as the square of the number of columns. The second step runs faster than the first step.
Enterprise discovery	~ 2X	Enterprise discovery is a mixture of column profiling, data domain discovery, key discovery, and foreign key discovery. Enterprise discovery scales as the average of these functions.

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Data Sizes and Profiling Functions Overview

Data Sizes and Profiling Functions Overview