Profiling and Discovery Sizing Guidelines

Back Next

Scaling

Scaling for profiling indicates linear scaling. Doubling the size of the input doubles the time it takes to process the input. Similarly, doubling the processing power halves the time to process the same data.

You can consider two ways of scaling. The first way is how a fixed set of data scales when the hardware changes. You can either add more resources to a single machine or create a grid of machines. The second way is how a profile function performs when the number of columns or number of rows change for a fixed hardware configuration.

The worksheets that estimate the resources are based on general recommendations. Sometimes, the recommendations might not result in a single Informatica Data Integration Service node due to your budget constraints.

For example, a node with 128 cores might exceed your budget. However, the Profiling Service Module can scale across an Informatica grid. In this example, you can replace the 128 core machine with a grid of eight nodes with 16 cores each. The Profiling Service Module divides the workload evenly among these nodes. To realize the performance gains with more hardware, you can adjust the profiling configuration parameters appropriately.

The scaling strategy depends on the profile job type:

Column Profile: Column profile uses both small source and large source strategies. When the data source contains fewer than 100,000 rows, the profile runs on all the columns for all the rows at the same time. This small source strategy applies to both relational and flat file data sources. When the data source contains more than 100,000 rows, the profile makes successive passes over the data to compute the column aggregate statistics, such as value frequencies, patterns, and data types. Each pass processes one or more columns based on the data source.

Relational sources, where you can push down the column profiling query, process one column at a time. If the RDBMS can handle queries, you can double the number of connections to reduce the processing time by half. Flat file sources can batch process five columns at a time by default. To reduce the time required to run a column profile by half, you can double the number of cores. When you double the cores, you need to adjust the other hardware components. For example, you need to add memory for additional buffering while sorting and increase the number of temporary disks to increase the disk throughput.
Data Domain Discovery: The scaling strategy for data domain discovery is similar to that of column profiles. You can use the small source approach to run all the rows using the data domain discovery algorithm. The large source approach differs based on the data source type.

When you run data domain discovery on relational sources, the profile pushes the query processing for each column to the database. Therefore, the profile scales in the number of connections to the RDBMS, assuming the RDBMS can handle the queries. Data domain discovery runs batch jobs of data domains against columns with default values of 20 data domains for every 50 columns. Additional cores help the profile scale if the data source has a large number of columns and data domains. When you compare with column profiles, data domain discovery does not need additional memory or temporary disk spindles unless the data domain rule logic requires the additional resources.
Key and Functional Dependency Discovery: Both key discovery and functional dependency discovery algorithms usually operate on a small sample of the data source, typically less than 100,000 rows. The algorithms are not multithreaded and do not benefit from scaling the number of CPU cores.

Each algorithm reads from the sample and then makes successive passes over the data. Then, the Profiling Service Module writes large intermediate results to temporary files. You can add more memory and faster disks of up to two spindles to make the algorithms perform faster. An increase in the number of CPU cores does not affect scaling.
Overlap and Foreign Key Discovery: Both overlap and foreign key discovery algorithms operate in two steps. The first step computes the signatures of the data sources and second step computes either the overlaps or foreign keys.

The signature computation is CPU intensive and scales with the number of cores. Signature computation is the most time-consuming step. When you double the number of cores, signature computation takes half the time. Overlap discovery cannot scale by design because it is single threaded. Foreign key discovery partitions the job and scales with additional CPU cores.
Enterprise Discovery: Enterprise discovery scales to the extent that the column profile, data domain discovery, key discovery, and foreign key discovery scale. You can add more CPU cores for the enterprise discovery to scale because multiple profile functions can run in parallel. However, when you add more CPU cores, you must increase the number of disk spindles because the Profiling Service Module writes intermediate to the temporary disk.