Table of Contents

Search

  1. Preface
  2. Introduction
  3. The Design Issues
  4. Standard Population Choices
  5. Parsing, Standardization and Cleaning
  6. Customer Identification Systems
  7. Fraud and Intelligence Systems
  8. Marketing Systems
  9. Simple Search
  10. Summary

Application and Database Design Guide

Application and Database Design Guide

The Key Index and Physical Data Organization

The Key Index and Physical Data Organization

Before searching can occur, the SSA-NAME3 keys must be stored in a database table and indexed.

The SSA-NAME3 "Key" Table

Because SSA-NAME3 generates multiple keys per name or address, most databases will require that a separate file or table be set up to contain, at the very least, the
SSA-NAME3
key and a unique ID Number to refer back to the source record.
However, because most search types require other information, other than the search name or address, in order for the searcher or batch process to make a decision, this other data must be accessible. In an online search, it is inefficient (for the searcher) to display just a list of names and ID numbers back to the screen if the searcher then has to individually display each record to make the decision. It is better to make that data available to the search process itself.
By using the other identifying data in the search process itself, the records can be matched and ranked prior to them being returned to the searcher. In a batch process this is a requirement if some level of auto-matching is to be implemented. The searcher will still need to see the data on the screen in order to make the final choice; however, this work will be made easier if the candidates are ranked.
So, the other identifying data that is used for display or matching purposes should be made available to the search.
For performance optimization, it is recommended that the
SSA-NAME3
key table also redundantly carries this other data: the names and other identity data used for matching, display or filtering. Unless the database tables are small and can fit in memory, this will reduce physical I/O by eliminating the table joins required to get the other data. That is, this will allow the search, matching and display to be achieved by accessing a single table only.
Further optimization can be achieved in many database types by declaring an index on the "concatenation" of the
SSA-NAME3
key and other data. This will allow the search, matching and display to be achieved by accessing the index only.

The SSA-NAME3 Key Table Layout

The actual layout or design of the SSA-NAME3 key table is up to the DBA as, apart from the
SSANAME3
key itself, all of the other data is sourced from the user’s own database tables.
The only requirement is that the
SSA-NAME3
keys are indexed. It will also be required to index the user’s ID field such that maintenance programs can access this table and keep the keys in-sync when changes are made to the source data.
We will use a simple example of an application that needs to search on name and match on address. For example, a search for the same "resident" or "customer".
Such an application will require a table containing the
SSA-NAME3
key, the customer name from which this key was created, the customer’s current address and the ID for this customer record. For example, the customer number.
SSA-NAME3 KEY
CUSTOMER NAME
CURRENT ADDRESS
CUSTOMER NO.
@&#$$$%ˆ
John Smith
23 Wood Lane
A12345
(2&%Z1$#
John Smith
23 Wood Lane
A12345
(H*#$YY%
Geoff Brown
25 Hodges Road
B23671
((2&ˆ7%$
Geoff Brown
25 Hodges Road
B23671
As you can see, there are multiple
SSA-NAME3
key records for each name. The number of keys generated for a name depends on the number of non-noise words in the name, and the Key Level selected. Using the Limited Key Level will generate fewer keys than Standard or Extended but will overcome less word order variation ("Limited", "Standard" and "Extended" are API Control parameters).
It is important to store all data that will be required either for matching, filtering or display purposes. An example of "matching" data is the customer’s name and address. An example of "filtering" data may be a security level that controls what users can see what data. An example of "display" data is usually whatever is used for matching and any additional data that may help the searcher make a decision (example, a customer "type" flag).
It is possible to simply store the
SSA-NAME3
key and the Customer Number, however, the "redundant data" method will provide the best performance when doing searches and is well worth the extra disk space required.

Optimizing the SSA-NAME3 Key Index

The
SSA-NAME3
keys are designed so that high volume files, especially those that are low in update activity, can benefit by loading the file such that the logical and physical sequence is the
SSA-NAME3
key. This requires that when doing the initial key generation run, the keys should be written to a flat file, sorted by the
SSA-NAME3
key, and then loaded to the database.
If the DBA is using a concatenated key index (example, an "index-only" table), then this physical index optimization is not necessary as the database will do the sequencing in the index itself.
These observations make it potentially damaging to apply a hashing algorithm, or even a bit truncation algorithm, to the key. The key is designed to optimize a very badly skewed search problem; care should be exercised in any further physical optimization.

Optimizing the SSA-NAME3 Key Load Process

The process of populating a full database table with
SSA-NAME3
keys will, in most database environments, be more efficient if the database’s loader utility is used, rather than using record level inserts to the database. This is more evident the greater the volume of records to be keyed.
Bulk key-load applications can be designed to write flat files of keys and data in a format for loading to the database using its loader utility.
As described in the previous section, after creating the file of keys and data, and before running the database’s loader utility, the file should be sorted on the
SSA-NAME3
key to improve access at search time. In some databases, this may also improve load performance.
If a large number of records are to be sorted, choose an efficient sort, making good use of memory and distributed sort work files.
Some database systems also allow indexing of key fields after the file data has been loaded, and this may be more efficient than building the index dynamically as the file is being loaded.
Bulk-loader programs will also normally work more efficiently if their input is a flat file, rather than a database table. When reading and writing flat files, further optimization can be gained by increasing the block or cache size of the input and output data files.
Client-server systems should avoid performing bulk-loads across the network. Rather, a server-based program will usually be more efficient.
When extreme volumes of data are to be keyed, try to create multiple concurrent instances of the keybuilding process which process non-overlapping partitions of the input data. These can then be put back together at sort time. Care should be taken, however, that the CPU or I/O sub-systems are not already overloaded.
If the opportunity exists to off-load the key-building work to a more efficient or less busy processor, such as a powerful computer, the overall efficiency of the process may be easier to manage and predict.

0 COMMENTS

We’d like to hear from you!