There are many reasons why two reports covering the same individual person, organization, or identity end up with conflicting data stored in the system. Understanding the source of these variations will lead to an appreciation of the "search" problem.
Where names are spoken, especially over a radio or telephone, a whole class of variations in the spelling can occur. This is usually referred to as a phonetic problem and the recognition of its existence lead to the use of historic algorithms such as SOUNDEX or those based simply on PHONETICS.
This phonetic variation is itself compounded by the fact that even when a name is spelled out by saying the letters, a degree of phonetic confusion can still occur.
The false presumption of many algorithms is that the phonetic problem is in its own right the major variation. There is in fact evidence to suggest that phonetics accounts for less than 25% of the variations.
A common reason for variation has to do with automatic or subconscious "correction" of names that are "familiar" to the user or that are very similar to common names. For example SMITHE becomes SMITH; WILLIAM at the end of a name becomes WILLIAMS.
A significant amount of error can occur when transcribing names from paper to paper or to a computer terminal. This error is often a mental one as in transposition or truncation (For example, Beth becomes Beht) or it can be keyboard dependent. For example, R instead of E on the QWERTY keyboard.
Another form of this error is the substitution of a graphically similar letter when reading handwriting. For example, G for Q, S for Z, or M for N.
Real World Variation
Many people have and use a nickname form of their first given name. In some cases, it is completely arbitrary as to when and how it is used. In other cases, the nickname will be used where less formality is required (for example, entering a contest) and the full first name where formality is expected. For example, applying for a driver’s license.
A difficult problem to address is associated with the common use of multiple family names, or the addition of words over time. The familiar case arises with marriage and divorce. Another common one is associated with children of parents who retain separate family names.
Another problem arises from Anglicization (more generally localization into local language, style or dialect). In populations where people change countries it is common to adjust the pronunciation and then the spelling of a foreign name to fit into local conventions.
Truncation of naming words occurs for a variety of reasons. Examples of truncation include formal abbreviation (Ltd, Inc, St, Rd), informal abbreviation (Mgmt, Intl, Svcs), and use of initials or acronyms. Another common reason for truncation can be traced to the design of a form, screen or database field that was not given enough space to hold all of the data.
Concatenation and Splitting of Words
The concatenation of words can be related to subconscious correction (for example, La Grande becomes Lagrande) or genuine transcription error where the space was not evident. Splitting of words can occur for similar reasons (for example, MacDonald becomes Mac Donald).
A large class of variation arrives from the fact that the protocol for saying, requesting or entering a name is not stable. Forms and screens are designed inconsistently. When providing information, we use a mixture of orders for family and given names, and often choose only to give some of the multiple words used to make up either the family name or given names.
In some cases words are left out, especially middle names. In other cases they are re-sequenced. In certain cases, the set of words used is a choice from one of two or three subsets of a group of name words.
Another common case encountered is where a person assumes they recognize a family name and therefore re-sequences the words. This situation arises frequently with names that can be either family or given names (for example, William Andrews or Joseph James), or because of the fact that a small percentage of the population is given quite unexpected first names like Adler or Brown or last names like Mister or Sister.
There are also several populations where the practice is to create multi- word family names out of both parents’ family names. In certain European and South American communities this problem is exaggerated by the fact that different sequences are formally used by members of the same family when referring to the same individual, and further aggravated in some populations where it is common to also abbreviate last words in family names to an initial.
The Name Distribution Problem
It is common knowledge that there are a few family names that encompass large groups of each population. This is also true for common given names and for the many common words in addresses or product names.
It is not as obvious as to how extreme this distribution really is.
For example, in some English speaking populations, names such as SMITH or WILLIAMS may each account for in excess of 1% of the population. Thus on a 5,000,000 record file the group with that family name may exceed 50,000 entries.
What is not often realized is that, not only is the file distributed in such a skewed fashion, it is also usually true that the queries or searches will be similarly distributed. That is, in excess of 1% of the searches may include one of the common surnames. One can imagine how tempting it is to bias the design of a key to service these "common names".
Conversely the distribution has an enormous "tail" of very uncommon names each shared by only a few members of the population. Many of these names will be unusual, and because of this, it is generally true that they will attract a greater degree of error and variation. If the key design is biased towards achieving reliability out of such uncommon name searches, it may lead to performance problem for common name searches.
SSA-NAME3 specifically addresses common words, codes and tokens in a different manner than uncommon data in order to ensure the best possible characteristics at each end of the distribution.
The Online Response Time Problem
An important benchmark of a name search is the response time before the candidates or matches are presented to the user.
The "online response" performance of a name search algorithm is an important concept. An algorithm that analyses thousands of records and, after a long period, supplies a small group of candidates to a user is usually less acceptable than an algorithm that can rapidly supply the user with a few highly probable candidates, but takes quite a long time to display the low probability candidates.
An example of the response time dilemma can be seen in those algorithms that provide an exact match as a fast path to the file entries. While there is certainly a fast response to the exact records, many records that are just as significant from a business point of view are missed. For example, records with minor variations are not available unless the user chooses to widen the search. This exact match with its quick response often leads to the other relevant entries not being discovered, and consequentially this will introduce duplicates.
The important aspect to consider is that the algorithm used must not force users to either extreme. The algorithm should allow rapid access to records of a particular significance. Each search dialogue that is designed should be able to take any level or depth of search that is relevant to the problem and should not be constrained by the algorithm.
Most implementations of name search algorithms do not provide multiple levels or depths of search from a physical access point of view. Such algorithms provide one "group code", "name code" or "phonetic key" that is used to select or access a "bucket" of file entries. This whole group or bucket is then analyzed to choose what to show to the user. With large volumes these buckets or groups are themselves very large.
SSA-NAME3 allows the sets of records found to be subdivided based upon the concept of search sets, such that only the records of appropriate level are physically accessed. It also provides the choice of several predefined "search levels" for optimizing different common needs.