Table of Contents

Search

  1. Preface
  2. Introduction
  3. The Design Issues
  4. Standard Population Choices
  5. Parsing, Standardization and Cleaning
  6. Customer Identification Systems
  7. Fraud and Intelligence Systems
  8. Marketing Systems
  9. Simple Search
  10. Summary

Application and Database Design Guide

Application and Database Design Guide

File Design for Optimal Name Search Performance

File Design for Optimal Name Search Performance

A search for all records that are relevant candidates for one set of search data, requires that one must display a list of good candidates on a screen or present this list to a batch matching/selection program.
To achieve this, the search data will be computed and used to Find, Read or Select a range of candidates from the database. This may be one or more logical requests to the database, for example, several "select" or "find" statements may be necessary.
The database size affects the average number of candidates in a given range. The bigger the file the more candidates are on file.
DBSize
Ellen Dodds
John Smiths
100,000
4
50
1,000,000
40
500
10,000,000
400
5,000
Searches are usually distributed the same way - if John Smith is .05% of the file, it’s also .05% of the transactions.
The online name search transactions logically require:
  • computation to build search ranges based on the data used in the keys;
  • physical access to the database to get index entries;
  • physical I/O to retrieve the display and matching data for all candidates;
  • computation to eliminate, rank, or sort before display.
The time consuming work is the physical I/O:
  • One or more physical I/O per index entry per logical database command;
  • One or more physical I/O per block of candidate data records;
  • If "joins" are necessary to get complete data for Matching and Display, more than one physical I/O will occur per data record.
The only way more than one candidate can be in a physical block is if the database file or table is ordered in the name key sequence. Even if this is true, little advantage is gained if access or "joins" to other tables are necessary to complete the display of a candidate line. Unless the tables are small enough to totally fit into memory, to achieve acceptable response time, all display or matching data for a candidate must be in the same record and candidates must be in physically adjacent records.
Achieving acceptable response time for even a single screen of candidates can not be done if each line requires multiple physical I/Os. You can reduce the number of candidates or screens by automating the choice, selection or matching process, but the data still has to be read from the database and presented to a "matching" program, so the need for physical optimization is still very necessary.
Of course the average number of candidate records read should be kept to a minimum, but this minimum will relate to the size of file, how common the name is, and to what degree it is important not to miss possible matches. This decision should be tied to individual transaction and business risk/benefit.
To get good response time in name search, de-normalizing and maintaining a copy of the relevant name search and matching data in optimum physical sequence is essential. It is the only way to avoid "joins" and extra physical I/O.

0 COMMENTS

We’d like to hear from you!