Table of Contents

Search

  1. Abstract for Profiling Sizing Guidelines
  2. Supported Versions
  3. Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Corporate Profile Sizing Example

Corporate Profile Sizing Example

This example summarizes the scenario of a large-size organization where data stewards use profiles and scorecards as part of the daily jobs. The data stewards have a specific area of competency.
In addition, a few power users, such as data stewards and data architects, use all the profile types to ensure data quality on a larger set of data assets. The power users also perform detailed assessments of some of the critical data sources. The profile implementation consists of a single high-end node connected to the database servers and ample disk space for flat files.
The following table describes the setup environment:
Setup Component
Description
Users
The setup environment has 40 to 60 data stewards using the Analyst tool and one to three power users.
Hardware
1 node, 32 cores, 128 GB, 8 x 2 TB disks, Linux, and 128 TB SAN.
Data
  • Flat files with 1 million to 1 billion rows.
  • Relational tables with 1 million to 1 billion rows on multiple medium to high-end servers.
Profile type
  • Data stewards run column profiles and scorecards.
  • Power users run enterprise discovery that includes column profile, data domain discovery, primary key discovery, and foreign key discovery. Power users also run unplanned overlap discovery and functional dependency discovery jobs.
Profiling warehouse
Set up on a different high-end server.
Model Repository Service
Set up on a different server.
Analyst Service
Set up on a different server.
The following table describes the recommended configuration parameters for the setup environment:
Parameter
Value
Maximum Execution Pool Size
100
Maximum Profile Execution Pool Size
75
Maximum Concurrent Profile Jobs
15
Maximum DB Connections
5
Maximum Concurrent Profile Threads
5
DIS: Temporary Directories*
5
*Each directory on a different disk.

Analysis

All data stewards cannot concurrently run a profile or scorecard at the same time on a Data Integration Service machine with 32 cores. If all data stewards run a concurrent profile job or scorecard job, the persisted queue of the Profiling Service Module might contain up to 35 queued jobs.
The assumption in this use case is that a data steward performs the following jobs daily:
  • Runs a profile or scorecard
  • Performs data analysis or other basic operations
  • Runs tasks that do not depend on the Profiling Service Module
The assumption for the maximum number of concurrent profile jobs is 25.
When a power user submits an enterprise discovery job, the Profiling Service Module adds the discovery job to the queue of column profile jobs that data stewards run. However, enterprise discovery jobs natively run with a reduced priority and minimize the interference with the profile jobs that data stewards run. Power users can analyze and plan to run the larger jobs when the Profiling Service Module is not at job capacity with the profile jobs that data stewards run.
Set the Maximum Profile Execution Pool Size parameter to the value of the Maximum Concurrent Profile Jobs parameter value multiplied by the Maximum DB Connections parameter value. The basis for this recommendation is that many data stewards might use different database servers and profile jobs do not overload any single server. The maximum number of current profile queries that you can run in this configuration is 125.
If there are fewer database servers than the number of database serves in this use case, you can decrease the Maximum DB Connections parameter value to 3 or 4. You can also adjust the Maximum Execution Pool Size and Maximum Profile Execution Pool Size parameters. You can set the Maximum Execution Pool Size parameter to a higher value than the Maximum Profile Execution Pool Size parameter value. Use the recommended setting for effective drill-down operations and previews even if other profile jobs use all the remaining Data Integration Service threads.
The data stewards might concurrently run profiles on multiple flat files. To ensure scalability, you can increase the number of temporary directories on separate disks so that the temporary, intermediate profiling results have the maximum I/O throughput.

0 COMMENTS

We’d like to hear from you!