Data Discovery Guide

Back Next

Column Profile Settings

The sampling options determine whether the Analyst tool runs a column profile on all rows of the data sources or limited number of rows.

The following table describes the column profile settings that you can configure for an enterprise discovery profile:

Option	Description
Enable column profiling	Runs a column profile as part of enterprise discovery.
Exclude approved data types and data domains from the data type and data domain inference in the subsequent profile runs	Excludes the approved data type or data domain from data type and data domain inference from the next profile run.

The following table describes the run-time environment option that you can configure for an enterprise discovery profile:

Option	Description
Native	The Analyst tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service runs these mappings and writes the profile results to the profiling warehouse.
Blaze	The Data Integration Service pushes the profile logic to the Blaze engine on the Hadoop cluster to run profiles.
Spark	The Data Integration Service pushes the profile logic to the Spark engine on the Hadoop cluster to run profiles.

The following table describes the sampling options that you can configure for an enterprise discovery profile:

Option	Description
All Rows	Runs a column profile on all rows in the data source. Supported on Native, Blaze, and Spark run-time environment.
First <number> Rows	Runs a profile on the sample rows from the beginning of the rows in the data object. You can choose a maximum of 2,147,483,647 rows. Supported on Native and Blaze run-time environment.
Limit n <number> Rows	Runs a profile based on the number of rows in the data object. When you choose to run a profile in the Hadoop validation environment, Spark engine collects samples from multiple partitions of the data object and pushes the samples to a single node to compute sample size. The Limit n sampling option supports Oracle, SQL Server, and DB2 databases. You cannot apply the Advanced filter with the Limit n sampling option. You can select a maximum of 2,147,483,647 rows. Supported on Spark run-time environment.
Random percentage	Runs a profile on a percentage of rows in the data object. Supported on Spark run-time environment.