Table of Contents

Search

  1. Abstract for Profiling Sizing Guidelines
  2. Supported Versions
  3. Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Enterprise Profile Sizing Example

Enterprise Profile Sizing Example

This example summarizes a large size organization with a competency center where data stewards run profiles and scorecards as part of the daily jobs. Data stewards have a specific area of competency.
In addition, the organization has a large number of data architects that ensure consistency in quality, structure, and content across all data assets in the organization. The number of data assets is up to many thousands. The organization has a grid configuration to process the scalable profile and enterprise discovery jobs.
The following table describes the setup environment:
Setup Component
Description
Users
The setup environment has 30 to 40 data analysts on the Analyst tool supported by 5 to 10 data architects on the Developer tool.
Hardware
Grid - 4 x 1 node, 12 cores, 64 GB, 6 x 2 TB disks, Linux, and 128 TB SAN.
Data
  • Flat files with 1 million to 1 billion rows.
  • Relational tables with 1 million to 1 billion rows on multiple medium to high-end servers.
Profile type
  • Data stewards run column profiles, data domain discovery, and scorecards.
  • Data architects run enterprise discovery that includes column profile, data domain discovery, primary key discovery, and foreign key discovery. Data architects also run unplanned overlap discovery and functional dependency discovery jobs.
Profiling warehouse
Set up on a different high-end server.
Model Repository Service
Set up on a different server.
Analyst Service
Set up on a different server.
The following table describes the recommended configuration parameters for the setup environment:
Parameter
Value
Maximum Execution Pool Size
100
Maximum Profile Execution Pool Size
50
Maximum Concurrent Profile Jobs
24
Maximum DB Connections
5
Maximum Concurrent Profile Threads
5
DIS: Temporary Directories
You must configure this value separately for each process node in the Grid. Each directory is on a different disk.
3

Analysis

This example requires a challenging configuration for optimal enterprise discovery on a single database server. You need to balance the power of the grid configuration to run many profile jobs that push the computation of the column profile jobs to the relational database.
When you run enterprise discovery, the profile jobs can run up to the Maximum Concurrent Profile Jobs value. Each profile job might run multiple queries equal to the value of the Maximum DB Connections parameter. This profile job run mechanism can easily overload the database if you do not set the configuration appropriately. To prevent database overload, you can decrease the Maximum DB Connections parameter value to 3.
The data architects that submit the enterprise discovery jobs can monitor the jobs. The data architects can then adjust the Maximum DB Connections parameter value based on the database server usage in the profile deployment environment.
The grid environment distributes the column profile mappings for flat files to the grid nodes in a round robin method. Therefore, you can increase the Maximum Concurrent Profile Threads parameter value from 1 to 2 to allow more mappings for column profile jobs on possibly different nodes. In addition, you must have a minimum of three temporary directories. You can set the value for temporary directories to a higher number to increase performance.

0 COMMENTS

We’d like to hear from you!