A word used in a Word Rule is a character string without embedded spaces, special characters or other delimiters. It can be a word, initial, number or code. The following Rule Types are supported for words.
Delete
Some words are considered to have no value when appearing in a name and some actually impede proper identification. It is preferable to remove these words from the name altogether. Those words will vary from place to place and sometimes even between applications (In person’s names Titles and suffixes like
Jnr
or
1st
are a good example; in company names words like
Ltd, Corporation
etc.). For this reason it is necessary to remove those words from the name and use the remaining parts only.
Words of type Delete will be deleted from the name and play no part in the building of a key, key range or in matching.
Replace
It is common that some words appear in different forms such that theWord Stabilization process cannot possibly match them, typical examples would be 1st and First. As this routine will replace all forms with one preferred word for key building purposes, it will be possible to correctly retrieve the desired records.
Nickname
This is similar to the Replace Rule Type, but is specific to certain types of person nicknames. This rule will replace the word, and any variations of the word that have one of the following diminutive endings (EE, EY, IE, EI, IA, AI, E, I, Y, A, O, IEE or any double letter) with a new word.
When processing names for key building, range arrays or matching, if one of the above endings is found on a word, it is temporarily stripped from the word, the preceding consonant de-duplicated, and the result is used to look for a rule in the Edit-List.
For example, consider the nickname BILLY. There are many possible spellings for this word – BILLIE, BILLI, BILL, BILLEE etc. The Nickname rule type creates a ’stem’ form by removing the diminutive ending and de-duping the consonant on the end. In this case, the result will be BIL. Thus, having a Nickname rule in the Edit-List for the word BIL with a replacement word of WILLIAM will ensure all the variations are replaced with WILLIAM.
Nickname rules can apply to all words in a name, therefore bear in mind when defining such rules the possible ambiguities and their prevalence in your data. For example, if the following Nickname rule is defined:
NK ROB>ROBERT<
This will work well for ROBBIE and ROBBY, but if ROBE is a surname in your population then it will also be changed to ROBERT.
Skip
There are a couple of reasons why one may want to define a word as a Skip word, however both require assistance from Informatica Corporation and the building of a Custom Population.
Preventing a Word from being used in the Major Position of a Key
For key building, it is possible to specify that no keys or key ranges are built which would have a Skip word in the Major position. This is an internal Algorithm option that must be set by an Informatica Corporation technician in a Custom Population for the customer, as it is not turned on by default.
Setting this option can result in significantly less keys, depending on the number of Skip words defined in the Edit-List and their frequency in the data. Use of this option may reduce disk space and improve performance, however, may also result in some loss in reliability. It is normally only a consideration for the high-volume user.
Reducing the Weight of Words in the Matching process
The weight of Skip words can be lowered in the Matching process. This only applies when two Skip words match. For example, when matching:
ABC SYSTEMS
DEF SYSTEMS
In the default Name Matching Purposes, all words would be weighted the same. However, with an internal Algorithm set on, and SYSTEMS defined as a Skip word, it is possible to assign a lower weight to that word pair and thus reduce the score.
This may be of interest in some special circumstances when overmatching is occurring due to too many skip type words matching.
Mark
Occasionally, an application may have a need to identify if a word is present in a name or address; for example, for the purpose of name classification (is this a company or a person name?).
It is not a standard feature of SSA-NAME3 to provide such classification "out-of-the-box", as it is not required functionality for effective key-building, range-building or matching.
However, if a user has such a requirement, then by using the Mark Edit-List rule type in conjunction with the ssan3_info API call, such results can be achieved.
This requires defining all of the words that would be used to classify the names to the Edit-List in a Category that uses the Mark rule type. Then, after a call is made to ssan3_get_keys, an ssan3_info call can be made using the ITEM=results.categories parameter to return the word categories found in the name. The category returned will be a two-character, internal form of the name.
Major Left Delete
A word categorized as Major Left Delete marks the word to its left as a Major word, and is then deleted. If the word is the first word in the name then the default rule for choosing the Major word is used (i.e. based on an internal setting which specifies whether the major word is most likely at the Left or Right end of the name.)
The major word itself is used when building certain search strategies, usually the Narrow search, and sometimes the Typical search, such that more emphasis is placed on the major word in the search.
The Major word is also used in certain Matching Purposes, for example in the Household Purpose to put more emphasis on the family name.
Major Left Keep
Same as the Major Left Delete rule type, except that the word is not deleted but rather marked as a Skip word.
Major Right Delete
A word categorized as Major Right Delete marks the word to its right as a Major word, and is then deleted. If the word is the last word in the name then the default rule for choosing the Major word is used (i.e. based on an internal setting which specifies whether the major word is most likely at the Left or Right end of the name.)
Major Right Keep
Same as the Major Right Delete rule type, except that the word is not deleted but rather marked as a Skip word.
Secondary Name
The Secondary Name rule type is a special type of Replace rule. Words defined as Secondary Names receive special treatment in the generation of search ranges and also in matching. However, Secondary Name definitions do not cause extra keys to be stored in the database.
Secondary Names are used for a number of purposes:
Improving selectivity for searches containing "ambiguous" words. For example, defining the following Secondary Name rules:
BERT >HERBERT
BERT >GILBERT
BERT >NORBERT
BERT >BERTRAM
BERT >ALBERT
HERBERT >BERT
GILBERT >BERT
NORBERT >BERT
BERTRAM >BERT
ALBERT >BERT
Does not cause the replacement values to happen in the keys, but in the search ranges. In this example, it means that a search for BERT will also generate search ranges that look for HERBERT, GILBERT, NORBERT, BERTRAM and ALBERT, and a search for any of the latter will also search for BERT. However, a search for ALBERT will not, for example, return a search range containing NORBERT. Thus, selectivity is improved.
It is important that such Secondary Name rules are defined symmetrically, or the results will not be reliable. So, if a rule is added for TINA => CHRISTINA, the reverse rule CHRISTINA => TINA should also be added.
Better handling of synonyms.
It is appropriate to define certain types of synonyms as Secondary Name rules instead of Replace rules. For example, take the synonyms VEHICLE and AUTOMOBILE. If a direct Replace rule is used to replace one with the other, for example VEHICLE ) AUTOMOBILE, then the word VEHICLE is lost forever and AUTOMOBILE is used in both keys and search ranges. Thus, a misspelling of VEHICLE will not find names that have the correct spelling, or other spelling variation of VEHICLE in them (as it will not fire the VEHICLE) AUTOMOBILE Replace rule).
By defining the VEHICLE ) AUTOMOBILE replacement rule as a Secondary Name rule, a search that contains the word VEHICLE will find names the contain that word and it’s phonetic variations, as well as names that contain AUTOMOBILE and its phonetic variations. Note, it is still important to add important abbreviations as Direct replacement rules (e.g. AUTO)AUTOMOBILE). Remember to define these types of Secondary Name rules as symmetric also.
Adding temporary replacement rules that do not require the SSA-NAME3 Key index to be re-built. Because Secondary Name rules only affect key ranges for searching, and not the keys that are stored in the database, it is possible to define new Secondary Name rules on the fly, and get immediate results without rebuilding keys. Therefore, even if a Direct Replacement rule would be best, if immediate benefit is required, then defining it as a Secondary Name rule will suffice. If and when some regular housekeeping is done to rebuild the SSA-NAME3 Keys index, these "temporary" Secondary Name replacement rules can be changed to Direct replacement rules.
Secondary Name rules are also used in Matching.
In the above example, BERT will score well with HERBERT, GILBERT, NORBERT, BERTRAM and ALBERT; however ALBERT and NORBERT will not score as highly.
Another use for Secondary Names in Matching is for defining geographical proximity in Addresses. For example, say the zip codes 02077 and 02120 are geographically next to each other and likely to have a certain amount of overlap in real addresses (for example, in the use of "vanity" addresses). Defining the following Secondary Name rule: 02077>02120, means that addresses with either zip code will score well against each other.
Do Not Stabilize
Words defined in a Category that have the Do Not Stabilize rule type will not undergo word stabilization. Word stabilization is a part of key and range building that addresses phonetic and orthographic error.
In most situations, it is advisable for words to be stabilized because this overcomes spelling and typing error in the word. In some cases, however, a number of very common words can stabilize to the same form and create a selectivity problem in large data volumes for searches involving those words. For example, the words JOHN, JAMES, JOAN, JAN and JANE all stabilize to the same word. Thus, when searching for JOHN SMITH, JOAN SMITH, JANE SMITH, JAN SMITH and JAMES SMITH will all be returned.
Provided the risk of missing a match has been properly examined, it may be decided to separate these in the keys by defining them as Do Not Stabilize.
In cases where the risk of missing a match is high, use of this rule is not recommended.