The nature of applications requiring name searching and matching vary considerably as does the relative importance of ’name search’ between different applications and users.
In the past, this ’name search’ problem has been studied and researched for some very restricted application areas where successful retrieval was critical (e.g. law enforcement searches). The more general approach to solving the name search problem has been for systems designers and analysts to design their own solutions typically using methodologies such as exact match alpha key, soundex, match-code, wild-card and text retrieval, and to apply the same solution to all system areas requiring name search.
Each of these methodologies we will call a name search ’Algorithm’.
The growing duplication of records in databases, the increasing frustration of customer service operators at slow or unreliable name searches, and worsening fraud problems based on name or address variations, all point to the fact that most of these Algorithms are, alone, not adequate for today’s volumes of data and nature of society.
The reasons are many. As soon as a system requires a search by name, the designers and eventually the users start encountering some or all of the following problems:
Errors made in spelling the spoken name.
Transcription errors for written names.
Missing first names or initials.
Mixed usage of first names and initials.
Nick-names, abbreviations, synonyms, unintentional concatenation or splitting of names
Extra words and word sequence variations.
Growing multiculturalism bringing more and more names and name structures which are not easily recognizable by the ’locals’.
Failures to find all parts of compound or account names.
Anglicization (Localization) of names causing variation between formal name as on Driver’s License and informal names on other documentation.
The problems created by the frequent use of certain common last and first names.
If the application has any significant volume of data at all the following problems will arise:
Length of response time of the system before an answer is available.
The problem of the system eliminating relevant names, and on the other hand, of showing too many names to make a choice.
If the problem is of special concern or is researched fully, the following points are often encountered:
The design of dialogues so that neither the operator nor the system comes too quickly to the conclusion that there is not a relevant match. (e.g. volume can cause data to be missed).
That increasing the width of the search to allow for more error significantly aggravates the response time and performance problem.
That progressive refinement of the system by addressing special cases introduces undiscovered problems elsewhere and progressively degrades the system.
That the system’s name rules cannot be changed unless all files are fully reprocessed according to the new rules.
That integration of data from different systems into an integrated search leads to new frustration for users because of variations in the name handling.
That a change in the Name search algorithm may improve overall performance and quality while achieving less success for certain previously satisfied special cases.
The On-Line Response Time Problem
An important characteristic of a name search is the response time it takes before the search Algorithm presents a good candidate to the user.
The ’on-line response’ performance of a name search Algorithm is an important concept. An Algorithm that analyses thousands of records and, after a long period, supplies a small group of candidates to a user is usually less acceptable than an Algorithm that can rapidly supply the user with a few highly probable candidates, but takes quite a long time to display the low probability candidates.
An example of the difficulty with this aspect can be seen in those algorithms that provide an exact match as a fast path to the file entries. While there is certainly a fast response to some matching records many other records that are from a business point of view just as significant (e.g. minor variations only), are not available unless the rest of the algorithm is used. This exact match with its quick response often leads to the other entries being ignored.
The important aspect to consider is that the algorithm used must not force users to either extreme. The algorithm should allow rapid access to records of a particular significance. Each search dialogue that is designed should be able to take any level or depth of search that is relevant to the problem and should not be constrained by the algorithm.
Most implementations of name search algorithms do not provide multiple levels or depths of search from a physical access point of view. Such algorithms provide one ’group code’, ’coded name’ or ’phonetic key’ that is used to select or access a ’bucket’ of file entries. This whole group or bucket is then analyzed to choose what to show to the user. Of course, this group can be ’scored’ to achieve a particular probability to decide what to show the user. However, if the first level or depth of search is inadequate the whole group or bucket is reprocessed to provide the next level. With large volumes these buckets or groups are themselves very large.
Good algorithms will actually allow buckets or groups to be subdivided based upon the concept of level, depth or probability such that only the records of appropriate level, depth or probability are physically accessed.
The Name Distribution Problem
The most confusing and aggravating characteristic of files of names is the unusual distribution of the actual names. It is common knowledge that there are a few family names that encompass large groups of each population. It is also common knowledge that this is so for given names.
It is not so obvious as to how extreme this distribution really is.
It is not unusual to find several common surnames (e.g.
SMITH
or
WILLIAMS
) in a population of file entries where each accounts for in excess of 1% of the population (thus on a 5,000,000 record file the group with that surname may exceed 50,000 entries).
What is not usually realized is that this fact is devastatingly important as, not only is the file distributed in such a skewed fashion, it is also usually true that the queries or searches will be identically distributed. That is, that in excess of 1% of the searches can be on one common surname.
If you extend this observation, to the fact that usually 10% of searches made will, with the above example, access a surname group where at least 25,000 entries exist, one can imagine how easy it is to bias the design of an algorithm to the ’common names’ area.
Conversely the distribution has an enormous ’tail’ of very uncommon names where very few members of the population have these names. If the algorithm design is biased towards performance for this ’tail’ it also usually aggravates the problem for common names.
In fact algorithms that are badly formulated often confuse a large percentage of the uncommon names with the common names they were derived from.
This distribution problem is not as stable as most designers would imagine. In a particular country its name distribution characteristics may be stable, but imagine a system specializing in Vietnamese migrants where 30% of the population hold one surname and another 15% has another (ie.45% of the population is covered by two surnames.)
The most successful name handling algorithms have to be aware of or designed for a specific population of names.
The Variation Problem
The reasons, that two reports covering the same individual person (organization or address), end up with differing variations of the person’s names stored in the system, are many. Understanding these variations will lead to an appreciation of the ’search’ problem.
Phonetic Variation
Where names are spoken, especially over a radio or telephone, a whole class of variations in the spelling can occur. This is usually referred to as a phonetic problem and the recognition of its existence leads to such original algorithms as
PHONIC
and
SOUNDEX
.
This phonetic variation is itself compounded by the fact that even when a name is spelled out by saying the letters, a degree of phonetic confusion can still occur.
The false presumption of many algorithms is that the phonetic problem is in its own right the major variation.
There is in fact some evidence to suggest that phonetics accounts for less than 25% of the variations.
Subconscious Correction
Probably one of the most common reasons for variation is to do with automatic or subconscious "correction" of names that have sounds or letter combinations that are very similar to common names. For example,
SMITHY
becomes
SMITH
;
WILLIAM
as a family name becomes
WILLIAMS
.
Such variations are often well handled by ’phonetic’ algorithms.
Orthographic Variation
A significant amount of error can occur when transcribing names from paper to paper or computer terminal. This type of error is often mechanical and can be keyboard dependent (e.g.
R
instead of
E
on a QWERTY keyboard). This error is often a mental one as in transcription or truncation (e.g.
beth
becomes
beht
).
However, the major form of this error is to do with substitution of a graphically similar letter when using hand writing (e.g.
G
for
Q
or
S
for
Z
or
M
for
N
).
Real Variation
A possible but usually low volume problem is associated with name changes. The familiar one being associated with marriage and divorce.
The most normal and fortunately addressable problem is Anglicization (more generally localization into local language, style or dialect). In populations where foreign migrants are frequently introduced it is normal to adjust the pronunciation and then the spelling of a foreign name to fit into local conventions.
Sequence Variation
A large class of variation arrives from the fact that several words can be used to make up either a surname or a person’s given names.
In some cases words are left out, especially middle names. In other cases they are re-sequenced. In certain cases, the set of words used is a choice from one of two or three subsets of a group of name words. This is typical of names given in cultures where it is normal to adopt new legal given names at puberty or on coming of age (e.g.
Papua New Guinea
) or in Western style countries where eastern faiths are common (e.g.
Fiji
).
One of the most complex cases encountered is that where identification of the family name is difficult.
This can arise for many reasons not the least of which is frequently used names that can be either family or given names (e.g.
William Andrews
or
Joseph James
).
There are also several populations where the practice is to create compound family names out of both parents’ family names. In certain Spanish, Portuguese and Far East countries this problem is exaggerated by the fact that different sequences are used by different members of the same family when referring to the same individual.