Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Frequently Asked Questions

Frequently Asked Questions

Sometimes, I see a low performance with the default Minimum Number of Rows for Split Process per Column and Maximum Number of Columns per Mapping values. How can I improve the
data profiling
task performance?
This issue might occur for large data sources where the number of rows is less than 100,000,000 and there are more than 50 columns. In this case,
Data Profiling
chooses column-based criteria and creates one subtask for every 50 columns. This consumes a lot of memory and processing power.
To resolve this issue, you can set the
Minimum Number of Rows for Split Process per Column
option to a higher value and the
Maximum Number of Columns per Mapping
option to a lower value.
For example, if a data source contains 10,000,000 rows and 100 columns, the
data profiling
task creates two subtasks with the default configuration. This consumes a lot of memory and results in a longer run time. In this case, you can retain the default value of
Minimum Number of Rows for Split Process per Column
and set the
Maximum Number of Columns per Mapping
option to 25.
Data Profiling
creates four subtasks which optimizes the performance and resource utilization. In addition, you can also increase the Secure Agent concurrency from the default 2 to (n), where n = Integer (0.8* number of cores) on the machine where Secure Agent runs.
Can I increase the Secure Agent concurrency to optimize the
Data Profiling
performance?
Yes, in addition to configuring the advanced options for
data profiling
tasks, you can configure the Secure Agent concurrency which impacts
Data Profiling
performance.
For example, assume that a data source contains 10,000,000 rows and 100 columns and the machine on which the Secure Agent runs has 4 cores.
To optimize the performance, perform the following steps:
  1. In Administrator, configure
    maxDTMProcesses
    to a value
    n
    , where n = Integer (0.8 * number of cores) on the machine where Secure Agent runs. In this case, set
    maxDTMProcesses
    to 3.
  2. In
    Data Profiling
    , create a profile for the data source.
  3. On the
    Schedule
    page, set
    Maximum Number of Columns per Mapping
    to 15.
  4. Save and run the profile.
In this case,
Data Profiling
generates 7 subtasks.
When I configure the Maximum Number of Columns per Mapping option to 20 for a data source with 100 columns, I see 8 subtasks for the profile on the My Jobs tab. Why do I see more subtasks than required?
When you create and run a profile, the following subtasks are generated and run:
  • Fetching the source row count-<number_of_chosen_rows>. This task is generated only once for a profile run.
  • Generating data profiling mappings. This subtask is generated only once for a profile run.
  • s_profiling. The number of subtasks generated are based on the
    Minimum Number of Rows for Split Process per Column
    and
    Maximum Number of Columns per Mapping
    values. In this case, five subtasks are generated.
  • Loading data from staging area to metric store. This task is generated only once for a profile run.
In this case, the total number of subtasks created for the task is eight subtasks.
What are the connections that create multiple mapping subtasks for a profile job?
The following is a list of connections that create multiple mapping subtasks for a profile job:
  • Oracle
  • SQL Server
  • Flat File
  • Azure Synapse SQL (ODBC)
  • Amazon Redshift v2
  • Snowflake Data Cloud
How can I confirm the number of mapping subtasks each profile job creates?
You can view the count of s_profiling jobs listed on the
My Jobs
tab. To view the s_profiling jobs, click the subtasks link on the
My Jobs
tab. For example, the following image displays the sample
My Jobs
tab with s_profiling jobs:
The profile job does not fail even if the Stop on Errors field value is set to a value that is less than equal to the sum of rows rejected.
This issue occurs when you apply multiple rules to a profile job.
Data Profiling
considers the maximum number of rows rejected by an individual rule. When a profile job includes multiple rules,
Data Profiling
stops the profile job on errors when the total number of reject rows cross the configuration of one of the rules, and not the sum of other rules.
I cannot view auto-assigned rules for the profiles after I change the source object or connection of my profile.
Data Profiling
does not assign rules automatically when you edit and make changes to a profile.
The automatic rule association feature does not work for profiles that I created in R36.
Data Profiling
associates rules automatically only if you create new profiles in 2021.07.M release (R37).
Profile job completes with the following warning message:
Record length [] is longer than line sequential buffer length [] for <file location>. Record will be rejected.
How do I resolve this issue?
To resolve this issue, increase the value of Line Sequential Buffer Length parameter till the error resolves. The parameter is located under the
Advanced Options
section on the
Schedule
tab of the profile.
  • The line sequential buffer length is used for delimited and fixed width flat files.
  • An easier way to calculate the line sequential buffer length is to increase the value by multiples of 2 of the record length value that appears in the warning message.
Why is the length of the value frequency higher than the actual value frequency length on the Results page?
This issue might occur if the column data contains special characters such as an apostrophe ('). To resolve this issue, you can remove the special characters from the column data and rerun the profile.
The number of successful and unsuccessful rows in the preview does not show correct number of rows in the scorecard dashboard when you run a profile with random sampling for a specific number of rows?
The profile runs honour the sampling options whereas the drill down feature does not honour the sampling options. Due to this, the number of successful and unsuccessful rows shown in the preview does not match with the total number of rows shown in the
Rule Occurrences
table of the scorecard dashboard.
Data Profiling
creates automatic rule specifications when insight validations have failed for columns that are either deselected from the profile definition or are deleted from the source. Can I delete these rules?
Yes, you can manually delete these rules from the
Explorer
tab.
Can you run native profiles using the serverless runtime environment?
Yes, you can run native profiles using the serverless runtime environment, but you cannot run Spark profiles.
Can you drilldown or query a profile that runs on a serverless runtime environment?
No, you cannot drilldown or query a profile that runs on a serverless runtime environment.

0 COMMENTS

We’d like to hear from you!