Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Tuning data profiling task performance

Tuning
data profiling
task performance

You can tune a
data profiling
task by configuring the advanced options for a
data profiling
task in
Data Profiling
. You can also configure the number of concurrent tasks for a Secure Agent in
Administrator
.
To optimize the performance of
data profiling
tasks,
Data Profiling
creates subtasks for concurrent processing of profile jobs. The number of subtasks is based on the number of columns and rows in the data source and on the advanced options that you set for
data profiling
tasks.
By default,
Data Profiling
uses the following criteria to create subtasks:
Row-based
Creates one subtask for each column when the data source exceeds 100,000,000 rows. To modify the default value, configure the
Minimum Number of Rows for Split Process per Column
option. For example, the source object has 50 columns and 101,000,000 rows,
Data Profiling
creates 50 subtasks. If the rows in the source object exceed the default
Minimum Number of Rows for Split Process per Column
value,
Data Profiling
creates one subtask for each column in the source object.
Column-based
Creates one subtask for every 50 columns and rules when the data source contains 100,000,000 rows or lesser. To modify the default value, configure the
Maximum Number of Columns per Mapping
option. For example, the source object has 80 columns and 10,000,000 rows,
Data Profiling
creates 2 subtasks. If the columns in the source object exceed the default
Maximum Number of Columns per Mapping
* value,
Data Profiling
creates one subtask for every 50 columns and another subtask for the remaining columns.
Data profiling
prioritizes row-based criteria. To prioritize column-based criteria, set the
Minimum Number of Rows for Split Processing per Column
option to a value that is greater than the actual number of rows in the source.
You can configure the advanced options on the
Schedule
tab for each
data profiling
task. The following table lists the advanced options and recommendations for optimum performance:
Option
Recommendations
Maximum Number of Value Frequency Pairs
Default is 500. Decrease or increase this value based on the business need.
Maximum Number of Patterns
Default is 10. Decrease or increase this value based on the business need.
Pattern Threshold Percentage
Default is 5. Decrease or increase this value based on the business need.
Infer Date and Time
By default,
Data Profiling
infers the date and time for a column of date or time data type. Clear this option if you do not want to infer the date and time for a column of date or time data type in the data source.
Data Profiling
performance might be impacted because it consumes a lot of resources to infer date and time.
Detect Outliers
By default, outliers are detected in the profile results. Clear this option if you do not want to detect and view outliers in the data source.
Minimum Number of Rows for Split Process per Column
Default is 100,000,000. Increase or decrease this value based on the business need. Row-based criteria uses this option to optimize performance.
For example, if you set the value to 100,000 and the number of rows in the source object is 100,500 and the columns is 30,
Data Profiling
creates 30 subtasks for each column in the source object.
Maximum Number of Columns per Mapping*
Default is 50. Increase or decrease this value based on the business need. Column-based criteria uses this option to optimize performance.
For example, you set the value to 30 and
Minimum Number of Rows for Split Processing per Column
value to 100,000,000. If the source object contains 149 columns and 70,000 rows.
Data Profiling
creates a subtask for each 30 columns, which results in five subtasks. Four subtasks contain 30 columns each, and one subtask contains 29 columns.
Maximum Memory per Mapping
Default is 512 MB. Increase or decrease this value based on the business need.
Default buffer block size
Default is Auto. Enter a numeric value and append KB, MB, or GB to the value to increase or decrease the value based on the business need.
DTM buffer size
Default is Auto. Enter a numeric value and append KB, MB, or GB to the value to increase or decrease the value based on the business need.
By default, a minimum of 12 MB is allocated to the buffer at run time.
You might increase the DTM buffer size in the following circumstances:
  • When a task contains large amounts of character data, increase the DTM buffer size to 24 MB.
  • When a task contains
    n
    subtasks, increase the DTM buffer size to at least
    n
    times the value for the task with one subtask.
  • When a source contains a large binary object with a precision larger than the allocated DTM buffer size, increase the DTM buffer size so that the task does not fail.
Line Sequential Buffer Length
Default is 1024. Increase the value if the source flat file records are larger than 1024 bytes.
* The mapping is a type of subtask.
Data Profiling
creates and runs subtasks for a
data profiling
task to process the data concurrently.

0 COMMENTS

We’d like to hear from you!