The Clustering Process reads records from the database or input file. For each record, it generates a search range using the
KEY-FIELD
. The Name Index is searched to create a list of candidate records with similar keys. The set of candidates are then read from the database and scored against the search record.
The process can be optimized by
reducing the size of the candidate set, thereby reducing the amount of scoring required, and/or
reducing the cost of scoring two records
utilizing multiple CPUs
reducing database I/O
The following sections discuss ways in which to achieve these goals.