User Guide

10.1
- 10.5 HotFix 2
- 10.5 HotFix 1
- 10.5
- 10.2 HotFix 1
- 10.2
- 10.0 HotFix 1
- 10.0

Back Next

The Job Definition

This section begins with the

Job-Definition

keyword. The fields are as follows:

Field	Description
NAME=	A character string identifying the job. This is a mandatory parameter.
COMMENT=	This is a text field that is used to describe the Job’s purpose.
IDX-LIST=	This is a comma-separated list of IDX names used in conjunction with the Load-All-Indexes option to limit the number of IDXs to be loaded. Normally Load-All-Indexes means that all IDXs that have been defined are to be loaded.
FILE=	A parameter used to define name of the Logical-File entity which describes either an input or output file to be used by this job.
TYPE={PRE\|SORTIT\|LOADIT\|CLUSTER\|EXTRACT\|POST}	A character string that describes the type of job. Refer to the Clustering Suite section in the Introduction chapter for more details. This is a mandatory parameter.
CLUSTERING-METHOD=method	Specifies how to assign records to clusters for a job of type CLUSTER . This setting is ignored for other job types. The parameter method is one of the following: BEST A new record is added to the cluster that contains the best matching record, or is allocated to a new cluster if its score does not reach the specified minimum for a successful (i.e. accepted) match. Best is the default Clustering-Method Option. MERGE All clusters that the record is successfully matched with (i.e. are accepted) are merged into a single cluster and the new record is added to that cluster. SEED Each record is allocated to a separate cluster. For example, if the only requirement is to discover the records in one file which match to records in another file, then the larger of the two files could be SEED ed first, then the second file APPEND ed to that file using the NONEWCLUSTERS option. PRE-CLUSTERED[(FieldName[,Grouped,Ordered])] The input file is pre-clustered. Each input record contains a field with a cluster number in it. This FieldName defaults to cluster , although you may nominate any FieldName from your view definition. Unless the additional optional keywords Grouped and Ordered are used, the field must be defined with a format of I,4 on the Database file, although it may have any compatible format on the input view. Also, the original cluster numbers will be translated to new cluster numbers when loaded into the database. Consequently, if two or more input files contain records with the same cluster numbers, these clusters will not be maintained when loaded into the database. Each file will be allocated to a different range of cluster numbers. Cluster number 0 is not allowed. An example of using PRE-CLUSTERED is when the input file already has duplicate id-numbers, which define groupings of related records. This file may then be re-clustered using a different key field, and the NO-ADD option. If the additional optional keywords Grouped and Ordered are used, the pre-clustered field can be any size and any format. The input file must have all records with the same ID value consecutive ( Grouped ) and in ascending order ( Ordered ). Currently the Grouped option cannot be used without the Ordered option. We will later introduce internal sorting to make this ordering optional. MANY A record may become a member of multiple clusters. This option is not compatible with the MERGE option. For example, if the requirement is to populate a prospect file with do not mail or fraud records, the prospect file could be SEED ed, then the do not mail or fraud records could be clustered to this file using the MANY option. FIRST The first candidate that achieves an Accepted score is accepted into a cluster. This option introduces an order dependency. It is designed to be used only when the resulting clusters are to be discarded and the unmatched file contains the records of interest.
CHECKPOINT-TIME=n[s\|m\|h\|d]	This parameter informs the Data Clustering Engine to enter a Wait state after clustering records for n seconds/minutes/hours/days. n is assumed to have units of seconds if it is not qualified by the optional s , m , h or d unit parameter. Refer to the Stopping and Restarting Clustering section for more information on how to use this parameter.
STATUS-TIME=n[s\|m\|h\|d]	This parameter informs the Data Clustering Engine to write a status report after clustering records for n seconds/minutes/hours/days. n is assumed to have units of seconds if it is not qualified by the optional s , m , h or d unit parameter.
INPUT-SELECT=n, INPUT-SELECT=[Count(n),] [Skip(n),] [Sample(n)]	This parameter is used to define input file processing options. When specified in the first form above, the number n is treated as the number of records to be read from the input file. An equivalent method of specifying this is Count(n) . The value n must be a positive non-zero number. You may skip some records before processing begins by specifying Skip(n) . You may also process every nth record by specifying Sample(n) . Note that the INPUT-SELECT statement is ignored by the Cluster step if the data has been preloaded. In this case you can use the INPUT-SELECT statement in the LOADIT step.
INPUT-HEADER=	Describes the number of bytes to ignore at the start of the input file. This is useful for some types of files that contain a fixed length header before the actual data records.
OPTIONS=	A comma separated list of option keywords for the job: INPUT-APPEND This causes the PRE job to append its output to an existing PRE output file. This allows running multiple PRE steps before the SORTIT step. NO-NEW-CLUSTERS if a record does not successfully match any existing records, do not create a new cluster for it. i.e. do not create any clustering relationship. In this case, the unmatched records can be written to the file specified by the UNMATCHED-FILE Clustering parameter. MATCH-ALL-MEMBERS Match the record against all members of the cluster, not just the voting members. LOAD-ALL-INDEXES Instructs LOADIT to load all indexes declared in the Project definition file. This is useful when there is only one clustering loading a file (seeding) and more than one search/clustering has been defined using different indexes. RE-INDEX Instructs LOADIT to read records from an existing database file and generate a new key index instead of reading from an input file. This is useful when there is a desire to re-cluster the file using a new key field to improve the reliability of the clusters. Refer to the How To Re-Cluster Data section for more details. If the data used for the new index is available during the initial load (first clustering job), the option LOAD-ALL-INDEXES can be used to load all the indexes during the initial load. NO-ADD Used to recluster records which were loaded as preclustered, or previously clustered. NO-ADD prevents data records from being added to the database. Refer to the How To Re-Cluster Data section on for more details. USE-ATTRIBUTES Honors the voting attribute. By default this option is turned off, except when Clustering-Method, Many is used without the NO-ADD option, in which case this option is turned on. Refer to the Voting Attribute section for more information. SET-ALL-VOTE All members that are added can vote. This is the default, except when Clustering- Method, Many is used without the NO-ADD option, in which case SET-VOTE-NONE is the default. Refer to the Voting Attribute section for more information. SET-NONE-VOTE All members that are added can not vote. This requires that there already are existing cluster members that can vote (from a previous clustering). If not, the result would be that all records would become single-member clusters. Any new clusters created will stay as single-member clusters. Refer to the Voting Attribute section for more information. SET-HEADERS-VOTE Only the founding member of the cluster can vote (one per cluster). Refer to the Voting Attribute section for more information. STATUS-APPEND Append messages to the Clustering Status File instead of overwriting it. REVERSE-SORTIN Reverse the sort order of keys in the SORTIT job. PARTITION data is not reversed. KEY-DATA is not used by the sort process. This option is only available for Beta testing. It may or may not be present in future releases. NO-PARTITIONS-STATS Disable logging of Partitions statistics. If you have a very large number of partitions then it is recommended since the logging will slow the process significantly. THREADS(#) Refer to the Utilizing multiple CPUs section for more details.
CANDIDATE-SET-SIZE-LIMIT=n	Informs the CLUSTER step to process searches by building a list of candidate records, eliminating duplicates, and then scoring the remainder. n specifies the maximum number of unique entries in the list. The default limit is 10000 records. A value of 0 disables this processing. Any candidates that do not fit in the list generate an Audit Trail record of type Overflow . This process makes scoring more efficient, when candidates are found more than once. However, it can affect the clustering results if the Clustering-Method is sensitive to the order in which records are scored. Example, the BEST method will select the record with the best score, but if two or more records achieve the best score, then the first is selected. As deduping can reorder the records, a different record might be selected and the clustering result may differ over two otherwise identical runs. The TRUNCATE-SET option will terminate the search for candidates once the list becomes full. It is used to prevent very wide searches. However, if a search is terminated prematurely there is no guarantee that any of the candidates will be accepted and/or the best candidates have been found.
CANDIDATE-SET-WARNING-LEVEL=n	This specifies a threshold value. If the set of candidate records is greater than or equal to this limit n , an Audit Trail record (type SetWarning ) is written. The default value is one quarter of the CANDIDATE-SET-SIZE-LIMIT .
CANDIDATE-SET-REPORT-LIMIT=n	The cluster step will tabulate the number of records in each candidate set. This parameter n determines the size of the biggest set for which discrete counts will be maintained. At the end of the CLUSTER job, a histogram will be displayed (entitled Histogram: ranges - candidates count). The default value is equal to the CANDIDATE-SET-SIZE-LIMIT .
OUTPUT-OPTIONS=	A comma separated list of options for the POST job. Only-Singles Only report single member clusters. Only-Plurals Only report clusters with more than one member. Indent Blank out the Cluster ID except for the first member in each cluster. The default is to fill in DCE on all cluster members. Trim Remove trailing blanks from output records. CR Add carriage return to the end of output records. Report This enables the following combination of options: Indent , Trim , CR . Layout Write the view description used to generate the report into the report header. Voting Only report Voting records. To report only Non-Voting, use --Voting . Default is to report all members.

The following table shows the options that are applicable to each job type.

	pre	sortit	loadit	cluster	extract	post
NAME
COMMENT
TYPE
FILE
CLUSTERING-METHOD
CHECKPOINT-TIME
STATUS-TIME
INPUT-SELECT
INPUT-HEADER
OPT=INPUT-APPEND
OPT=NO-NEW-CLUSTERS
OPT=RE-INDEX
OPT=NO-ADD
OPT=USE-ATTRIBUTES
OPT=SET-ALL-VOTE
OPT=SET-NONE-VOTE
OPT=SET-HEADERS-VOTE
OPT=STATUS-APPEND
OPT=REVERSE-SORTIN
CANDIDATE-SET-*
OUTPUT-OPTIONS

Rename Saved Search

Table of Contents

User Guide

User Guide

The Job Definition

The Job Definition