Table of Contents

Search

  1. Abstract for Profiling Sizing Guidelines
  2. Supported Versions
  3. Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling Warehouse Guidelines for Column Profiling

Profiling Warehouse Guidelines for Column Profiling

The profiling warehouse stores profiling results. More than one Profiling Service Module may point to the same profiling warehouse. The main resource for the profiling warehouse is disk space.
Column profiling stores the following types of results in the profiling warehouse:
  • Statistical and bookkeeping data
  • Value frequencies
  • Staged data
Statistical and Bookkeeping Data Guidelines
Each column contains a set of statistics, such as the minimum and maximum values. It also contains a set of tables that store bookkeeping data, such as profile ID. These take up very little space and you can exclude them from disk space calculations. Consider the disk requirement to be effectively zero.
Value Frequency Calculation Guidelines
Value frequencies are a key element in profile results. Value frequencies list the unique values in a column along with a count of the occurrences of each value.
Columns with low cardinality have few values. Columns with high cardinality can have millions of values. The Profiling Service Module limits the number of unique values it identifies to 16,000 by default. You can change this value.
Use this formula to calculate disk size requirements:
Number of columns X number of values X (average value size + 64)
where
  • Number of columns is the sum of columns and virtual columns in the profile run.
  • Number of values is the number of unique values. If you do not use the default of 16,000, use the average number of values in each column.
  • Average value size includes Unicode encoding of characters.
  • The value 64 bytes indicates 8 bytes for the frequency and 56 bytes for the key.
Staged Data Guidelines
Staged data or cached data is a copy of the source data that the Data Integration Service uses for drill-down operations. The staged data might use large amount of disk space based on the type of data source.
Use the following formula to calculate the disk size requirements for staged data:
number of rows X number of columns X (average value size + 24)
The cache key size is 24.
Sum the results of this calculation for all cached tables.
The following table lists the required disk space for an 80 column table that has 100 million rows with an equal mixture of high cardinality columns and low cardinality columns:
Type of Data
Disk Space
Value frequency data
50 MB
Cached data
327,826 MB
Total
327,876 MB
The Data Integration Service stages the source data when you choose to cache data. If you do not cache data for drilldown, the disk space requirement is significantly less. All profiles store the value frequencies.

Memory and CPU Needs

The profiling warehouse does not have significant memory requirements.
Memory
The queries run by the Profiling Service Module do not use significant amounts of memory. Use the manufacturer's recommendations based on the table sizes.
CPU
You can use the following CPU recommendations for the profiling warehouse:
  • 1 CPU for each concurrent profile job. This applies to each relational database or flat file profile job, not to each profile mapping.
  • 2 CPUs for each concurrent profile job if the data is cached.

0 COMMENTS

We’d like to hear from you!