Data Profiling
- Data Profiling
- All Products
Property
| Description
|
---|---|
Insight Statement
| Description or statement that explains the inferred insight.
|
Score
| Shows the following scores for the inferred insights:
|
Insight Type
| Shows the following types of insights:
|
Columns
| Column name for which the insight is relevant.
|
Status
| Status of the insight. When insights are generated for the first time, the status appears blank.
|
Insight Type
| Algorithm
| Score Interpretation
|
---|---|---|
Completeness Check
| Computes the percentage of total rows with null values, blank values, empty values, or values that contain only zeros in a column.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
92 to 100 - OK
0 to 3 - Low
3 to 5 - Medium
5 to 8 - High
|
Uniqueness Check
| Computes the percentage of non-unique rows based on the following formula:
Percentage of non-unique rows = (Total Rows - Unique Rows) / Total Rows * 100
Insights are generated if the computed percentage of non-unique rows is less than 3%.
If a column contains one or more null values, then the insight is not generated.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
0 to 2 - High
2 to 3 - Low
|
Column Length Deviation
| Computes the length of alphanumeric values or numeric values on value frequency that falls more than two times the standard deviation from the mean value.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
0 to 1 - Low
1 to 5 - Medium
5 - High
|
Number Value Distribution
| Computes the percentage of value frequency values in relation to the total number of rows profiled that falls more than two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
15 to 100 - OK
0 to 1 - Low
1 to 5 - Medium
5 to15 - High
The score can never be 100%.
|
Date Validity Check
| Checks only for columns that have Date as the inferred data type and computes the percentage of values with dates that do not comply with a valid date pattern.
Null values are ignored in the computation.
This insight type is applicable for columns that have String with date content as the data type.
| 0 - OK
0 to 5 - Low
5 to 10 - Medium
10 - High
|
Date-Locale Check
| Checks for columns that have one or more date values that do not match the locale format. Computes the number of values that follow different date locale formats.
This insight type is applicable for columns with String data type.
| 1 - OK
2 - Medium
3 - High
|
Day-Date Distribution
| Extracts the day for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
15 to 100 - OK
0 to 1 - Low
1 to 5 - Medium
5 to15 - High
The score can never be 100%.
|
Month-Date Distribution
| Extracts the month for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
15 to 100 - OK
0 to 1 - Low
1 to 5 - Medium
5 to15 - High
The score can never be 100%.
|
Year-Date Distribution
| Extracts the year for the dates on the value frequency and calculates the mean and standard deviation. Computes the dates where the days fall over two times the standard deviation or falls out of 95% of the mean value.
This insight type is applicable for columns with any of the following data types:
| 0 - OK
15 to 100 - OK
0 to 1 - Low
1 to 5 - Medium
5 to15 - High
The score can never be 100%.
|
Completeness Variation
| Computes the variation on the number of null values and the values that contain only zeros in the column between the current profile run and the truncated mean of the last five profile runs, discarding the lowest and highest values. Uses the actual mean if there are less than four previous profile runs. Insights are not generated if there are no previous profile runs.
Percentage of completeness variation = (Current Mean - Previous Mean) / Previous Mean * 100
The following values are considered as null values for data types:
| [0 to 80] - OK
(80 to 90] - Medium
(90 to ∞) - High
|
Distinct Variation
| Checks if there is more than 70% increase on the number of distinct values in the column between the current profile run and the truncated mean of the last five profile runs, discarding the lowest and highest values. Uses the actual mean if there are less than four previous profile runs. Insights are not generated if there are no previous profile runs.
Percentage of distinct variation = (Current Mean - Previous Mean) / Previous Mean * 100
If the previous mean value is zero, then the distinct variation percentage increases to +∞. If the distinct variation percentage is negative, then insights are not generated.
| (-∞ to 70] - OK
(70 to 90] - Low
(90 to 200] - Medium
(200 to +∞) - High
|
MinMax Variance
| Checks if there is more than 70% increase on the difference between the minimum values and the maximum values in the column when compared to the previous profile run.
CLAIRE does not consider columns for insight recommendations in the following scenarios:
Percentage of min max variation = (Delta Current - Delta Previous) / Delta Previous * 100
Where:
For example, the following are the minimum and maximum values in the po_create_date column for two profile runs:
Delta Previous = 5540 days
Delta Current = 8345 days
Percentage of min max variation = (8345 - 5540) / 5540 = 50.6%
The 50.6% score interprets the data anomaly for the column as OK.
| [0 to 70] - OK
(70 to 100] - Medium
(100 to ∞) - High
|
Top Pattern Stability
| Checks if the top pattern with ≥30% compliance decreases by a large amount in comparison to the previous profile run. A large decrease may indicate that shape of data changed more than expected. The decrease is measured as a negative number computed using the following formula:
CurrentPercent - PreviousPercent / PreviousPercent * 100
The insight considers columns that contain a major pattern in the previous run. The same filter must be used for the both runs
| (-99, -70] - High
(-70, -60] - Medium
(-60, -30] Low
(-30, 0] - OK
|
Spelling Analysis
| Creates a fingerprint for each string value and compares the number of non-null unique fingerprints to the number of non-null values. CLAIRE runs the insight if the difference as a percentage is too high, which indicates several misspellings.
To qualify, the top 80% patterns must contain only letters (X) and up to 3 spaces and hyphens. This is to accommodate names.
The insights get generated if 95% of the value frequencies in the values have five or more characters.
| [0,0.5] - OK
(0.5,1] - Low
(1, 2] - Medium
(2,100] - High
|
Distribution Shift
| Tracks the mean and standard distribution of values over four or more profiles. The expectation is either the mean and standard deviation remains constant or shifts consistently up or down. For example, a table containing population size information that might shift consistently up or down at the same rate.
This insight type is applicable for columns with any of the following data types:
| [0,2] - OK
(2,3] - Medium
(3,∞] - High
|
Column Token Deviation
| The number of tokens in the value frequency string values that fall more than two standard deviations from the mean. A token is any sequence of alpha-numeric characters separated by white space and the following special characters: . , / -.
| 0 - OK
(0,1] - Low
(1, 5] - Medium
(5,100] - High
|
Special Characters
| Checks data for special characters that are not included in the top 80% of the patterns. CLAIRE considers this data anomalous. Additionally, CLAIRE does not consider the string data types when Data Profiling infers the numeric data type such as decimal, integer, or float as 100%.
| 0 - OK
(0,1] - Low
(1, 3] - Medium
(3,100] - High
|
Null Date Analysis
| Checks string data type columns that might include one of all the zeros or nine values from a default date pattern. The insight type is applicable for columns of string data type.
If a string data type column contains all of the zeros and nines from the default date pattern, the insight considers the values as invalid. For example,
If a string data type column contains a valid date, month, or year part from the default date pattern, the insight considers the values as valid. For example,
The insight also considers a NULL value as a valid date pattern.
| 0 - OK
(0,1] - Low
(1,2] - Medium
(2,100] - High
|