Table of Contents

Search

  1. Preface
  2. Introduction
  3. Installation
  4. Design
  5. Operation

The Job Definition

The Job Definition

This section begins with the
Job-Definition
keyword. The fields are as follows:
Field
Description
NAME=
A character string identifying the job. This is a mandatory parameter.
COMMENT=
This is a text field that is used to describe the Job’s purpose.
IDX-LIST=
This is a comma-separated list of IDX names used in conjunction with the
Load-All-Indexes
option to limit the number of IDXs to be loaded. Normally
Load-All-Indexes
means that all IDXs that have been defined are to be loaded.
FILE=
A parameter used to define name of the Logical-File entity which describes either an input or output file to be used by this job.
TYPE={PRE|SORTIT|LOADIT|CLUSTER|EXTRACT|POST}
A character string that describes the type of job. Refer to the
Clustering Suite
section in the Introduction chapter for more details. This is a mandatory parameter.
CLUSTERING-METHOD=method
Specifies how to assign records to clusters for a job of type
CLUSTER
. This setting is ignored for other job types. The parameter method is one of the following:
BEST
A new record is added to the cluster that contains the best matching record, or is allocated to a new cluster if its score does not reach the specified minimum for a successful (i.e. accepted) match. Best is the default
Clustering-Method
Option.
MERGE
All clusters that the record is successfully matched with (i.e. are accepted) are merged into a single cluster and the new record is added to that cluster.
SEED
Each record is allocated to a separate cluster. For example, if the only requirement is to discover the records in one file which match to records in another file, then the larger of the two files could be
SEED
ed first, then the second file
APPEND
ed to that file using the
NONEWCLUSTERS
option.
PRE-CLUSTERED[(FieldName[,Grouped,Ordered])]
The input file is pre-clustered. Each input record contains a field with a cluster number in it. This
FieldName
defaults to
cluster
, although you may nominate any
FieldName
from your view definition.
Unless the additional optional keywords
Grouped
and
Ordered
are used, the field must be defined with a format of
I,4
on the Database file, although it may have any compatible format on the input view. Also, the original cluster numbers will be translated to new cluster numbers when loaded into the database.
Consequently, if two or more input files contain records with the same cluster numbers, these clusters will not be maintained when loaded into the database. Each file will be allocated to a different range of cluster numbers.
Cluster number 0 is not allowed.
An example of using
PRE-CLUSTERED
is when the input file already has duplicate id-numbers, which define groupings of related records. This file may then be re-clustered using a different key field, and the
NO-ADD
option.
If the additional optional keywords
Grouped
and
Ordered
are used, the pre-clustered field can be any size and any format. The input file must have all records with the same ID value consecutive (
Grouped
) and in ascending order (
Ordered
). Currently the Grouped option cannot be used without the Ordered option. We will later introduce internal sorting to make this ordering optional.
MANY
A record may become a member of multiple clusters. This option is not compatible with the
MERGE
option. For example, if the requirement is to populate a prospect file with do not mail or fraud records, the prospect file could be
SEED
ed, then the do not mail or fraud records could be clustered to this file using the
MANY
option.
FIRST
The first candidate that achieves an Accepted score is accepted into a cluster. This option introduces an order dependency. It is designed to be used only when the resulting clusters are to be discarded and the unmatched file contains the records of interest.
CHECKPOINT-TIME=n[s|m|h|d]
This parameter informs the Data Clustering Engine to enter a Wait state after clustering records for
n
seconds/minutes/hours/days.
n
is assumed to have units of seconds if it is not qualified by the optional
s
,
m
,
h
or
d
unit parameter. Refer to the
Stopping and Restarting Clustering
section for more information on how to use this parameter.
STATUS-TIME=n[s|m|h|d]
This parameter informs the Data Clustering Engine to write a status report after clustering records for
n
seconds/minutes/hours/days.
n
is assumed to have units of seconds if it is not qualified by the optional
s
,
m
,
h
or
d
unit parameter.
INPUT-SELECT=n,
INPUT-SELECT=[Count(n),] [Skip(n),] [Sample(n)]
This parameter is used to define input file processing options. When specified in the first form above, the number
n
is treated as the number of records to be read from the input file. An equivalent method of specifying this is
Count(n)
. The value
n
must be a positive non-zero number. You may skip some records before processing begins by specifying
Skip(n)
. You may also process every nth record by specifying
Sample(n)
. Note that the
INPUT-SELECT
statement is ignored by the Cluster step if the data has been preloaded. In this case you can use the
INPUT-SELECT
statement in the
LOADIT
step.
INPUT-HEADER=
Describes the number of bytes to ignore at the start of the input file. This is useful for some types of files that contain a fixed length header before the actual data records.
OPTIONS=
A comma separated list of option keywords for the job:
INPUT-APPEND
This causes the
PRE
job to append its output to an existing
PRE
output file. This allows running multiple
PRE
steps before the
SORTIT
step.
NO-NEW-CLUSTERS
if a record does not successfully match any existing records, do not create a new cluster for it. i.e. do not create any clustering relationship. In this case, the unmatched records can be written to the file specified by the
UNMATCHED-FILE
Clustering parameter.
MATCH-ALL-MEMBERS
Match the record against all members of the cluster, not just the voting members.
LOAD-ALL-INDEXES
Instructs
LOADIT
to load all indexes declared in the Project definition file. This is useful when there is only one clustering loading a file (seeding) and more than one search/clustering has been defined using different indexes.
RE-INDEX
Instructs
LOADIT
to read records from an existing database file and generate a new key index instead of reading from an input file. This is useful when there is a desire to re-cluster the file using a new key field to improve the reliability of the clusters. Refer to the
How To Re-Cluster Data
section for more details.
If the data used for the new index is available during the initial load (first clustering job), the option
LOAD-ALL-INDEXES
can be used to load all the indexes during the initial load.
NO-ADD
Used to recluster records which were loaded as preclustered, or previously clustered.
NO-ADD
prevents data records from being added to the database. Refer to the
How To Re-Cluster Data
section on for more details.
USE-ATTRIBUTES
Honors the voting attribute. By default this option is turned off, except when Clustering-Method,
Many
is used without the
NO-ADD
option, in which case this option is turned on. Refer to the
Voting Attribute
section for more information.
SET-ALL-VOTE
All members that are added can vote. This is the default, except when Clustering- Method,
Many
is used without the
NO-ADD
option, in which case
SET-VOTE-NONE
is the default. Refer to the
Voting Attribute
section for more information.
SET-NONE-VOTE
All members that are added can not vote. This requires that there already are existing cluster members that can vote (from a previous clustering). If not, the result would be that all records would become single-member clusters. Any new clusters created will stay as single-member clusters. Refer to the
Voting Attribute
section for more information.
SET-HEADERS-VOTE
Only the founding member of the cluster can vote (one per cluster). Refer to the
Voting Attribute
section for more information.
STATUS-APPEND
Append messages to the Clustering Status File instead of overwriting it.
REVERSE-SORTIN
Reverse the sort order of keys in the
SORTIT
job.
PARTITION
data is not reversed.
KEY-DATA
is not used by the sort process.
This option is only available for Beta testing. It may or may not be present in future releases.
NO-PARTITIONS-STATS
Disable logging of Partitions statistics. If you have a very large number of partitions then it is recommended since the logging will slow the process significantly.
THREADS(#)
Refer to the Utilizing multiple CPUs section for more details.
CANDIDATE-SET-SIZE-LIMIT=n
Informs the
CLUSTER
step to process searches by building a list of candidate records, eliminating duplicates, and then scoring the remainder.
n
specifies the maximum number of unique entries in the list. The default limit is 10000 records. A value of 0 disables this processing. Any candidates that do not fit in the list generate an Audit Trail record of type
Overflow
.
This process makes scoring more efficient, when candidates are found more than once. However, it can affect the clustering results if the
Clustering-Method
is sensitive to the order in which records are scored. Example, the
BEST
method will select the record with the best score, but if two or more records achieve the best score, then the first is selected. As deduping can reorder the records, a different record might be selected and the clustering result may differ over two otherwise identical runs.
The
TRUNCATE-SET
option will terminate the search for candidates once the list becomes full. It is used to prevent very wide searches. However, if a search is terminated prematurely there is no guarantee that any of the candidates will be accepted and/or the best candidates have been found.
CANDIDATE-SET-WARNING-LEVEL=n
This specifies a threshold value. If the set of candidate records is greater than or equal to this limit
n
, an Audit Trail record (type
SetWarning
) is written. The default value is one quarter of the
CANDIDATE-SET-SIZE-LIMIT
.
CANDIDATE-SET-REPORT-LIMIT=n
The cluster step will tabulate the number of records in each candidate set. This parameter n determines the size of the biggest set for which discrete counts will be maintained. At the end of the
CLUSTER
job, a histogram will be displayed (entitled Histogram: ranges - candidates count). The default value is equal to the
CANDIDATE-SET-SIZE-LIMIT
.
OUTPUT-OPTIONS=
A comma separated list of options for the
POST
job.
Only-Singles
Only report single member clusters.
Only-Plurals
Only report clusters with more than one member.
Indent
Blank out the Cluster ID except for the first member in each cluster. The default is to fill in DCE on all cluster members.
Trim
Remove trailing blanks from output records.
CR
Add carriage return to the end of output records.
Report
This enables the following combination of options:
Indent
,
Trim
,
CR
.
Layout
Write the view description used to generate the report into the report header.
Voting
Only report Voting records. To report only Non-Voting, use
--Voting
. Default is to report all members.
The following table shows the options that are applicable to each job type.
pre
sortit
loadit
cluster
extract
post
NAME
COMMENT
TYPE
FILE
CLUSTERING-METHOD
CHECKPOINT-TIME
STATUS-TIME
INPUT-SELECT
INPUT-HEADER
OPT=INPUT-APPEND
OPT=NO-NEW-CLUSTERS
OPT=RE-INDEX
OPT=NO-ADD
OPT=USE-ATTRIBUTES
OPT=SET-ALL-VOTE
OPT=SET-NONE-VOTE
OPT=SET-HEADERS-VOTE
OPT=STATUS-APPEND
OPT=REVERSE-SORTIN
CANDIDATE-SET-*
OUTPUT-OPTIONS

0 COMMENTS

We’d like to hear from you!