Application and Database Design Guide

Application and Database Design Guide

Cleaning, Scrubbing and Rejecting Invalid Data

Cleaning, Scrubbing and Rejecting Invalid Data

There is a sound business reason for attempting to make sure that the data that is captured and stored in computer systems has the maximum value to the business.
There is also great value in having the data in easy to process shape.
However there is little value in having easy to process data that has become untrue as a result of the shape it has been transformed into. The thrust associated with Cleaning, Scrubbing and Transforming data, once it is in a computer system, and the thrust that says lets only store "valid" data suffers from a large number of pitfalls:
  • Much of the data about transactions is legally necessary data and cannot be changed without approval of the customer;
  • Statistical techniques for enhancing data, are "good for statistics" but introduce error that is destructive for matching;
  • Many cleaning techniques are not reversible, for example, changing Bobby to Robert; changing St to Street when it is possible that it could be Saint;
  • Users believe that the transformed data is true, and base decisions on it;
  • Merging two records can lead to loss of data if later you find it was in error and they should be split;
  • Rejecting invalid data, simply means it can not be used for anything and all business value of that data is lost;
  • Cleaning projects conducted by people, suffer from the normal inconsistencies that arise with all other data.
The problem is, the definition of "cleaning" as applied to identity data is subjective and prone to error:
  • if a name is input to a system as "MIKE SMITH", should that name be cleaned to "MICHAEL SMITH" prior to storage, or do the person’s identity documents say "MIKE SMITH"?
  • if the name was input as "SMITH MICHAEL", should cleaning reverse the order, or was the order correct, if rare?
  • if an address is input as "40 MARINA VIEW ST HUBERTS" should "ST" be cleaned to "STREET", or could it be "SAINT"?
Cleaning data prior to its entry into a Data Warehouse, in the interests of cross-system standardization, is only safe when no assumptions need to be made about the data values.
Once the original value is lost, there may be no history from which to reconstruct it if the decision was wrong. For maximum truth in search, matching, and identification, work with and keep the original data in its real world format as originally entered. In moving an obviously erroneous date field like 11/33/1967 into a data warehouse, or off an application form into a database, unless it is possible to go back to the real world source and get the true value, the best you can do is store the data exactly as it arrived. From a search, matching or identification point of view, a later transaction for a similarly named record with a date of 1/3/1967 or 11/3/1967 or 11/13/1967 or 11/23/1967 can be easily matched to the above date. If such an invalid date is left out of the data warehouse, or converted to a "null" or "unknown" format the value of the data is entirely lost.
Aggressive validation of such data simply leads to users inventing "valid" data. If you do not believe this, simply run a histogram on the date fields in a database - you are likely to find that the 1st of the month is abnormally common, and in really old data you may even find that 01/01/99 or even 99/99/99 is common. As error and variation is quite normal in real world data, systems must be designed to work with this error and variation. The abnormality is an attempt to define away or clean this data.
Making any substantial change to the identity data stored in a system, without also keeping its original form, is counter-productive to future needs for that data. Storing identity data in essentially the same form as it has been supplied by the identity owner, is a safe decision. For business functions, which require it, storing an additional "cleaned" copy of that data or cleaning the data specifically as part of the business function, is good design.
SSA-NAME3 is designed to accept the error and variation inherent in real world data and provide the ability for your search application to use real, uncleaned and unscrubbed data.

0 COMMENTS

We’d like to hear from you!