User Guide

Back Next

Clustering Suite

The clustering suite consists of the following programs:

PRE: This program performs the pre-clustering formatting step. It is used to read the raw input file using a view and creates records that conform to the database view of the data. This process can optionally add a unique Source-Id field to each record if the input data does not already contain one.
SORTIT
,
LOADIT
and
CLUSTER
are also capable of reading the input files, so under normal circumstances the
PRE
step would be omitted. However,
PRE
should be run when you wish to process multiple input files, each with a different format and/or you wish to allocate a different Source-Id to each input file.

SORTIT: This program can be used to sort the input file. This step is optional and is used to improve the performance of the clustering phase. The input file can be either the output from the
PRE
program, or the raw input-file. If the latter is the case,
SORTIT
will perform the
PRE
process (internally) prior to sorting the records.
The records are sorted using the key field specified in the Clustering Definition.
Sorting is used to reorder the input file by preferred key in order to place similar records close together on disk, thereby improving performance due to locality of reference. However, if you use negative keys or a negative search strategy, which is the default strategy in SSA-NAME3 standard populations, you’ll be searching randomly over the disk, thereby negating the benefit. A
SORTIT
step should be omitted in this situation.
LOADIT: This program pre-loads the input data to the database and generates keys (which are stored in a database index). It is an optional process. It can read its input from either
a raw input file
the output from
PRE
the output from
SORTIT
If
LOADIT
reads its input from a raw input file it will internally perform the
PRE
process prior to loading the records to the database.

CLUSTER: This program clusters the data records using the rules specified in the Clustering Definition. Data records can be read from:
a raw input file
the output from
PRE
the output from
SORTIT
the database (output from
LOADIT
)
The result of the clustering process is held in the database in the form of the cluster relationship index.
EXTRACT: This program is used to create a database index required by the
POST
program. Normally this index is created during the
CLUSTER
phase, unless you specify the
DELAY
option. In this case, you must schedule an
EXTRACT
job prior to running
POST
.
POST: This is a general-purpose post-clustering extraction and reporting step. The layout of the report/file is controlled using a sophisticated view processor