Data Profiling

Back Next

Tuning data profiling task performance

Tuning
data profiling
task performance

You can tune a

data profiling

task by configuring the advanced options for a

data profiling

task in

Data Profiling

. You can also configure the number of concurrent tasks for a Secure Agent in

Administrator

To optimize the performance of

data profiling

tasks,

Data Profiling

creates subtasks for concurrent processing of profile jobs. The number of subtasks is based on the number of columns and rows in the data source and on the advanced options that you set for

data profiling

tasks.

By default,

Data Profiling

uses the following criteria to create subtasks:

Row-based: Creates one subtask for each column when the data source exceeds 100,000,000 rows. To modify the default value, configure the Minimum Number of Rows for Split Process per Column
option. For example, the source object has 50 columns and 101,000,000 rows,
Data Profiling
creates 50 subtasks. If the rows in the source object exceed the default
Minimum Number of Rows for Split Process per Column
value,
Data Profiling
creates one subtask for each column in the source object.
Column-based: Creates one subtask for every 50 columns and rules when the data source contains 100,000,000 rows or lesser. To modify the default value, configure the Maximum Number of Columns per Mapping
option. For example, the source object has 80 columns and 10,000,000 rows,
Data Profiling
creates 2 subtasks. If the columns in the source object exceed the default
Maximum Number of Columns per Mapping
* value,
Data Profiling
creates one subtask for every 50 columns and another subtask for the remaining columns.

Data profiling

prioritizes row-based criteria. To prioritize column-based criteria, set the Minimum Number of Rows for Split Processing per Column
option to a value that is greater than the actual number of rows in the source.

You can configure the advanced options on the

Schedule

tab for each

data profiling

task. The following table lists the advanced options and recommendations for optimum performance:

Option	Recommendations
Maximum Number of Value Frequency Pairs	Default is 500. Decrease or increase this value based on the business need.
Maximum Number of Patterns	Default is 10. Decrease or increase this value based on the business need.
Pattern Threshold Percentage	Default is 5. Decrease or increase this value based on the business need.
Infer Date and Time	By default, Data Profiling infers the date and time for a column of date or time data type. Clear this option if you do not want to infer the date and time for a column of date or time data type in the data source. Data Profiling performance might be impacted because it consumes a lot of resources to infer date and time.
Detect Outliers	By default, outliers are detected in the profile results. Clear this option if you do not want to detect and view outliers in the data source.
Minimum Number of Rows for Split Process per Column	Default is 100,000,000. Increase or decrease this value based on the business need. Row-based criteria uses this option to optimize performance. For example, if you set the value to 100,000 and the number of rows in the source object is 100,500 and the columns is 30, Data Profiling creates 30 subtasks for each column in the source object.
Maximum Number of Columns per Mapping*	Default is 50. Increase or decrease this value based on the business need. Column-based criteria uses this option to optimize performance. For example, you set the value to 30 and Minimum Number of Rows for Split Processing per Column value to 100,000,000. If the source object contains 149 columns and 70,000 rows. Data Profiling creates a subtask for each 30 columns, which results in five subtasks. Four subtasks contain 30 columns each, and one subtask contains 29 columns.
Maximum Memory per Mapping	Default is 512 MB. Increase or decrease this value based on the business need.
Default buffer block size	Default is Auto. Enter a numeric value and append KB, MB, or GB to the value to increase or decrease the value based on the business need.
DTM buffer size	Default is Auto. Enter a numeric value and append KB, MB, or GB to the value to increase or decrease the value based on the business need. By default, a minimum of 12 MB is allocated to the buffer at run time. You might increase the DTM buffer size in the following circumstances: When a task contains large amounts of character data, increase the DTM buffer size to 24 MB. When a task contains n subtasks, increase the DTM buffer size to at least n times the value for the task with one subtask. When a source contains a large binary object with a precision larger than the allocated DTM buffer size, increase the DTM buffer size so that the task does not fail.
Line Sequential Buffer Length	Default is 1024. Increase the value if the source flat file records are larger than 1024 bytes.
* The mapping is a type of subtask. Data Profiling creates and runs subtasks for a data profiling task to process the data concurrently.