Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Advanced options

Advanced options

You can configure the advanced options to detect outliers, infer the date and time, and infer other profile-related parameters.
The following table lists the advanced options that you can configure for a profile:
Option
Description
Maximum Number of Value Frequency Pairs
Number of column values with the highest frequencies appear in the profile results. Default is 500.
For example, if you set the value to 100, only the top 100 values appear in the profile results.
If you do not want to save the value frequency information of a profile in the profiling warehouse, set the value to 0.
Maximum Number of Patterns
Number of patterns with the maximum number of occurrences appear in the profile results. The rest of the patterns appear under the
Patterns
Others
category on the
Results
area. Default is 10.
For example, if you set the value to 3, the top 3 patterns appear with their statistics, and the rest of the patterns are consolidated under the
Others
category.
Pattern Threshold Percentage
Maximum percentage of values used to derive a pattern in the profile results. Default is 5.
For example, when you set the value to 4, the patterns that are 4% and higher appear individually with their statistics and the rest of the patterns are consolidated under the
Others
category.
Infer Date and Time
Infers the date and time for a column of date or time data type. Default is Yes.
Detect Outliers
Detects pattern and value frequency outliers in the source object. Default is Yes.
Minimum Number of Rows for Split Process per Column
If the source object contains more rows than the minimum number of rows that you enter here,
Data Profiling
uses one subtask for each source column when the profile is run. Default is 100,000,000.
Maximum Number of Columns per Mapping
Number of columns for each mapping when the number of source rows is fewer than the
Minimum Number of Rows for Split Processing per Column
value. Default is 50.
Maximum Memory per Mapping*
Maximum amount of memory that you want to allocate for each mapping. Default is 512 MB.
Default buffer block size
Size of buffer blocks used to move data blocks from sources to targets. Default is Auto.
Enter one of the following options:
  • Auto. Uses automatic memory settings. When you use Auto, configure
    Maximum Memory per Mapping
    .
  • A numeric value. Enter the numeric value that you want to use. The default unit of measure is bytes. Append KB, MB, or GB to the value to specify a different unit of measure. For example, 512MB.
DTM Buffer Size
Amount of memory allocated to the task from the DTM process. Default is Auto.
By default, a minimum of 12 MB is allocated to the buffer at run time.
Use one of the following options:
  • Auto. Uses automatic memory settings. When you use Auto, configure
    Maximum Memory per Mapping
    .
  • A numeric value. Enter the numeric value that you want to use. The default unit of measure is bytes. Append KB, MB, or GB to the value to specify a different unit of measure. For example, 512MB.
Line Sequential Buffer Length
Number of bytes that the task reads for each row in a flat file source. Default is 1024.
* The mapping is a type of subtask.
Data Profiling
creates and runs for a
data profiling
task to process the data concurrently.
The default values for the advanced options have been derived to provide the best performance. However, you can configure the values based on your requirements. To optimize the
data profiling
task performance, see Tuning data profiling task performance.
You can configure the following advanced options for a profile with Avro or Parquet source objects:
  • Maximum Number of Value Frequency Pairs
  • Maximum patterns
  • Threshold percentage for patterns
  • Detect outliers

0 COMMENTS

We’d like to hear from you!