Table of Contents

Search

  1. Abstract for Profiling Sizing Guidelines
  2. Supported Versions
  3. Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Flat File and Mainframe Sources for Column Profiles

Flat File and Mainframe Sources for Column Profiles

When you run a profile on a flat file, the Profiling Service Module divides the job into multiple mappings that infer the metadata for the columns and virtual columns. Each mapping can run serially, or two or more mappings can run in parallel. In addition, the Profiling Service Module generates another type of mapping to cache the source data. This mapping always runs in parallel with the column profile mappings because it takes longer than a column profile mapping.
When you run a profile on a mainframe data source, the Profiling Service Module groups as many columns as possible into a single mapping. This grouping minimizes the number of table scans on the data source. Mainframe data sources require more disk space than flat files to store the temporary computations.
You can compute the total resources required by column profiles after you consider the following requirements:
Column Profile Mapping Requirements
The CPU, memory, and disk space requirements for a column profile mapping are as follows:
Component
Requirements
CPU
Column profiling consumes approximately 2.3 CPUs for each mapping. When you calculate the number of CPUs you need, round up the total to the nearest integer.
Memory for Mappings
The Profiling Service Module uses two methods for profile mappings. First, it applies a method that requires approximately 2 MB of memory for each column. If the first method does not work, it uses the second method of sorting columns with a buffer of 64 MB.
The minimum resource required is 10 MB, representing 2 MB • 5 columns. The maximum resource required is 72 MB, representing a 64 MB buffer for one high-cardinality column and 8 MB for the remaining four low-cardinality columns.
Memory for the Buffer Cache
The Profiling Service Module caches the flat file data as it reads the data. Profiling speed increases if the Profiling Service Module can cache all the file data.
The exception to using cache memory is when two or more mappings read a file concurrently. In this case, add 100 MB of memory. This enables the mappings to share the read operations and increase performance.
Disk
A profile mapping may need disk space to perform profiling computations. The following formula calculates the disk space for a single mapping:
2 X number of columns per mapping X maximum number of rows X ((2 bytes per character X maximum string size in characters) + frequency bytes)
where
  • 2 indicates two passes. Some analyses need two passes.
  • The default value for the number of columns for each mapping is 5.
  • Maximum number of rows is the maximum number of rows in any flat file.
  • 2 bytes per character is the typical number of bytes for a single Unicode character.
  • Maximum string size in characters is the maximum number of characters in any column in any flat file, or 255, whichever is less.
  • Frequency bytes value is 4. The frequency bytes store the frequency calculation during the analysis.
Perform the above calculation and allocate the disk space to one or more physical disks. Use one disk for each mapping, and use a maximum of four disks.
Operating System
Use a 64-bit operating system to accommodate memory sizes greater than 4 GB. A 32-bit system works if the profiling parameter fits within the memory limitations of the system.
These guidelines covers the optimal flat file profiling case, which uses five columns for each mapping. In some cases, the Profiling Service Module must run the profile for more than five columns in one mapping, for example, when running a profile on mainframe data where the financial cost of accessing the data can be high.
Profile Cache Mapping Requirements
A profile cache mapping caches data to the profiling warehouse and has different resource requirements than a column profile mapping.
The CPU, memory, and disk space requirements for a profile cache mapping are as follows:
Component
Requirements
CPU
The cache mapping requires approximately 1.5 CPUs.
Memory
The cache mapping requires no additional memory beyond the Data Transformation Manager thread memory.
Disk
The cache mapping requires no disk space.
Aggregate Profile Mapping Resources
To compute the total resources required by profiling, add the profile mapping requirements to the cache mapping requirements.
Use the following formula to determine the total profiling resources:
(number of concurrent profile mappings X resources for each mapping) + cache mapping resources

0 COMMENTS

We’d like to hear from you!