Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Insights

Insights

The Insights area displays the CLAIRE generated recommendations for your data that you can approve or reject.
The following table lists the properties that you can view on the
Insights
tab:
Property
Description
Insight Statement
Description or statement that explains the inferred insight.
Score
Shows the following scores for the inferred insights:
  • High. Data anomaly is high.
  • Medium. Data anomaly is medium.
  • Low. Data anomaly is low.
You can classify and review the inferred insights from higher sores to lower scores.
Insight Type
Shows the following types of insights:
  • Completeness Check. Data appears incomplete. The column includes one or more null, blank, or empty values or values that contain only zeros.
  • Uniqueness Check. The majority of the data values in the column are unique.
  • Column Length Deviation. The length of the data values in the column has a high standard deviation.
  • Number Value Distribution. Numeric values found outside the 95% standard deviation range.
  • Date Validity Check. One or more dates do not comply with a valid date pattern.
  • Date-Locale Check. One or more date values do not match the locale format.
  • Day-Date Distribution. Unusual distribution of day values in a date column.
  • Month-Date Distribution. Unusual distribution of month values in a date column.
  • Year-Date Distribution. Unusual distribution of year values in a date column.
  • Completeness Variation.
    • For integer or decimal data types: Unusual variation in the number of null values and values that contain only zeros in the column between the current profile run and the previous one to five profile runs.
    • For string, date, or timestamp data types: Unusual variation in the number of blank values, null values, and empty values in the column between the current profile run and the previous one to five profile runs.
  • Distinct Variation. Greater than 70% increase in the number of distinct values in the column between the current profile run and the previous one to five profile runs.
  • MinMax Variance. Greater than 70% increase in the difference between the minimum and maximum values in the column across the current and previous profile runs.
  • Top Pattern Stability. The topmost pattern of the column decreased by a large amount when compared to the previous profile run.
  • Spelling Analysis. The data values that are phonetically similar and contain inconsistent spelling.
  • Distribution Shift. The distribution of the data that might have shifted more than expected based on the mean and standard deviation of the profile that has run over multiple times.
  • Column Token Deviation. The number of tokens in a column that has a high standard deviation.
  • Special Characters. Columns with special characters that are not included in the top 80% of the patterns.
  • Null Date Analysis. String data type columns that might contain values from a default date pattern, such as 00/00/0000 and 99/99/9999.
Columns
Column name for which the insight is relevant.
Status
Status of the insight. When insights are generated for the first time, the status appears blank.
Each insight type follows an algorithm to look into columns, independently of other columns on the data set. These algorithms are based on the metrics calculated on every profile run. The following table lists the algorithm and the logic used to arrive at the scores for the inferred insights:
Insight Type
Algorithm
Score Interpretation
Completeness Check
Computes the percentage of total rows with null values, blank values, empty values, or values that contain only zeros in a column.
This insight type is applicable for columns with any of the following data types:
  • String
  • Date
  • Integer

    0 - OK

    92 to 100 - OK

    0 to 3 - Low

    3 to 5 - Medium

    5 to 8 - High

Uniqueness Check
Computes the percentage of non-unique rows based on the following formula:
Percentage of non-unique rows = (Total Rows - Unique Rows) / Total Rows * 100
Insights are generated if the computed percentage of non-unique rows is less than 3%.
If a column contains one or more null values, then the insight is not generated.
This insight type is applicable for columns with any of the following data types:
  • String
  • Date
  • Integer

    0 - OK

    0 to 2 - High

    2 to 3 - Low

Column Length Deviation
Computes the length of alphanumeric values or numeric values on value frequency that falls more than two times the standard deviation from the mean value.
This insight type is applicable for columns with any of the following data types:
  • String
  • Integer
  • Decimal

    0 - OK

    0 to 1 - Low

    1 to 5 - Medium

    5 - High

Number Value Distribution
Computes the percentage of value frequency values in relation to the total number of rows profiled that falls more than two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
  • String with all numeric patterns
  • Date
  • Integer

    0 - OK

    15 to 100 - OK

    0 to 1 - Low

    1 to 5 - Medium

    5 to15 - High

The score can never be 100%.
Date Validity Check
Checks only for columns that have Date as the inferred data type and computes the percentage of values with dates that do not comply with a valid date pattern.
Null values are ignored in the computation.
This insight type is applicable for columns that have String with date content as the data type.

    0 - OK

    0 to 5 - Low

    5 to 10 - Medium

    10 - High

Date-Locale Check
Checks for columns that have one or more date values that do not match the locale format. Computes the number of values that follow different date locale formats.
This insight type is applicable for columns with String data type.

    1 - OK

    2 - Medium

    3 - High

Day-Date Distribution
Extracts the day for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
  • String with date patterns
  • Date
  • Timestamp

    0 - OK

    15 to 100 - OK

    0 to 1 - Low

    1 to 5 - Medium

    5 to15 - High

The score can never be 100%.
Month-Date Distribution
Extracts the month for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
  • String with date patterns
  • Date
  • Timestamp

    0 - OK

    15 to 100 - OK

    0 to 1 - Low

    1 to 5 - Medium

    5 to15 - High

The score can never be 100%.
Year-Date Distribution
Extracts the year for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
  • String with date patterns
  • Date
  • Timestamp

    0 - OK

    15 to 100 - OK

    0 to 1 - Low

    1 to 5 - Medium

    5 to15 - High

The score can never be 100%.
Completeness Variation
Computes the variation on the number of null values and the values that contain only zeros in the column between the current profile run and the truncated mean of the last five profile runs, discarding the lowest and highest values. Uses the actual mean if there are less than four previous profile runs. Insights are not generated if there are no previous profile runs.
Percentage of completeness variation = (Current Mean - Previous Mean) / Previous Mean * 100
  • If the previous mean value is zero, then the completeness variation percentage increases to 100%.
  • If the completeness variation percentage is negative, then insights are not generated.
The following values are considered as null values for data types:
  • Integer - 0 and null values
  • Decimal - 0.0 and null values
  • String - Blank or empty string and null values
  • Date - Null values
  • Timestamp - Null values

    [0 to 80] - OK

    (80 to 90] - Medium

    (90 to ∞) - High

Distinct Variation
Checks if there is more than 70% increase on the number of distinct values in the column between the current profile run and the truncated mean of the last five profile runs, discarding the lowest and highest values. Uses the actual mean if there are less than four previous profile runs. Insights are not generated if there are no previous profile runs.
Percentage of distinct variation = (Current Mean - Previous Mean) / Previous Mean * 100
If the previous mean value is zero, then the distinct variation percentage increases to +∞. If the distinct variation percentage is negative, then insights are not generated.

    (-∞ to 70] - OK

    (70 to 90] - Low

    (90 to 200] - Medium

    (200 to +∞) - High

MinMax Variance
Checks if there is more than 70% increase on the difference between the minimum values and the maximum values in the column when compared to the previous profile run.
CLAIRE does not consider columns for insight recommendations in the following scenarios:
  • Difference between the minimum values and the maximum values in the column when compared to the previous profile run decreases.
  • Sources that have less than 1000 rows.
  • Columns that transition from 100% null to values.
Percentage of min max variation = (Delta Current - Delta Previous) / Delta Previous * 100
Where:
  • Delta Previous = Maximum value in the first run - Minimum value in the first run
  • Delta Current = Maximum value in the second run - Minimum value in the second run
For example, the following are the minimum and maximum values in the po_create_date column for two profile runs:
  • Previous run: Minimum = 01/01/1998, Maximum = 03/03/2013
  • Current run: Minimum = 02/01/2003, Maximum = 12/07/2025
Delta Previous = 5540 days
Delta Current = 8345 days
Percentage of min max variation = (8345 - 5540) / 5540 = 50.6%
The 50.6% score interprets the data anomaly for the column as OK.

    [0 to 70] - OK

    (70 to 100] - Medium

    (100 to ∞) - High

Top Pattern Stability
Checks if the top pattern with ≥30% compliance decreases by a large amount in comparison to the previous profile run. A large decrease may indicate that shape of data changed more than expected. The decrease is measured as a negative number computed using the following formula:
CurrentPercent - PreviousPercent / PreviousPercent * 100
The insight considers columns that contain a major pattern in the previous run. The same filter must be used for the both runs

    (-99, -70] - High

    (-70, -60] - Medium

    (-60, -30] Low

    (-30, 0] - OK

Spelling Analysis
Creates a fingerprint for each string value and compares the number of non-null unique fingerprints to the number of non-null values. CLAIRE runs the insight if the difference as a percentage is too high, which indicates several misspellings.
To qualify, the top 80% patterns must contain only letters (X) and up to 3 spaces and hyphens. This is to accommodate names.
The insights get generated if 95% of the value frequencies in the values have five or more characters.

    [0,0.5] - OK

    (0.5,1] - Low

    (1, 2] - Medium

    (2,100] - High

Distribution Shift
Tracks the mean and standard distribution of values over four or more profiles. The expectation is either the mean and standard deviation remains constant or shifts consistently up or down. For example, a table containing population size information that might shift consistently up or down at the same rate.
This insight type is applicable for columns with any of the following data types:
  • Integer
  • Decimal

    [0,2] - OK

    (2,3] - Medium

    (3,∞] - High

Column Token Deviation
The number of tokens in the value frequency string values that fall more than two standard deviations from the mean. A token is any sequence of alpha-numeric characters separated by white space and the following special characters: . , / -.

    0 - OK

    (0,1] - Low

    (1, 5] - Medium

    (5,100] - High

Special Characters
Checks data for special characters that are not included in the top 80% of the patterns. CLAIRE considers this data anomalous. Additionally, CLAIRE does not consider the string data types when Data Profiling infers the numeric data type such as decimal, integer, or float as 100%.

    0 - OK

    (0,1] - Low

    (1, 3] - Medium

    (3,100] - High

Null Date Analysis
Checks string data type columns that might include one of all the zeros or nine values from a default date pattern. The insight type is applicable for columns of string data type.
If a string data type column contains all of the zeros and nines from the default date pattern, the insight considers the values as invalid. For example,
  • 0000-00-00 or 9999-99-99 (year-month-day or year-day-month)
  • 00/00/0000 or 99/99/9999 (month/day/year or day/month/year)
  • 00000000 or 9999999 (YYYYMMDD)
If a string data type column contains a valid date, month, or year part from the default date pattern, the insight considers the values as valid. For example,
  • 21/99/9999
  • 99/02/9999
  • 99/99/1994
The insight also considers a NULL value as a valid date pattern.

    0 - OK

    (0,1] - Low

    (1,2] - Medium

    (2,100] - High


0 COMMENTS

We’d like to hear from you!