Table of Contents

Search

  1. Preface
  2. Introduction
  3. Installation
  4. Design
  5. Operation

Clustering Definition

Clustering Definition

This section begins with the
Clustering-Definition
keyword. The fields are as follows:
Field
Description
NAME=
A character string which identifies the
Clustering-Definition
. The name must not match any
Search-Definition
nor
Multi-Search-Definition
names in the same Project. This is a mandatory parameter.
CLUSTERING-ID=
A unique two-character ID prefixed to all cluster numbers generated by this Clustering. This is a mandatory parameter.
If the first Clustering definition is used for seeding and any subsequent Clusterings are adding to this seeded Clustering then all these Clusterings should use the same
CLUSTERING-ID
. See the
CLUSTERING-METHOD=SEED
section under User-Job-Definition for more information about seeding.
IDX=
The name of the IDX used by the clustering step. If this parameter is not given, then the default IDX name "kx" followed by the given
CLUSTERING-ID
is assumed. The IDX is defined in the
IDX-definition
section (see the
IDX Definition / NAME=
section.
INDEXES-PATH=
The path for Clustering index files.
DCE does not support spaces in file or PATH names
FORMATTED-FILE-PATH=
Optional path for the formatted data file
fmt.tmp
. Refer to the
Reformat Input Data
section.
COMMENT=
An optional character string describing this clustering step.
SEARCH-LOGIC= (alias to KEY-LOGIC=)
This parameter describes the logic to be used to generate search ranges to find candidate records from the IDT. It may differ from the
KEY-LOGIC=
used to generate keys for the IDT (as defined in the IDX-Definition). Refer to the
Search Logic
section for details. This is a mandatory parameter.
SCORE-LOGIC
This parameter describes the normal matching logic used to refine the set of candidate records found by the
Search-Logic
. This is a mandatory parameter unless at least one of the other
SCORE-LOGIC
parameters is specified. Refer to the
Score Logic
section for details.
PRE-SCORE-LOGIC
This optional parameter describes the lightweight matching logic used to refine the set of candidate records found by the Search-Logic. Refer to the
Score Logic
section for details.
KEY-SCORE-LOGIC=
This optional parameter describes the normal matching logic used to refine the set of candidate records found by the
Key-Logic
. Refer to the
Search Logic
section for details.
KEY-PRE-SCORE-LOGIC=
This optional parameter describes the light-weight matching logic used to refine the set of candidate records found by the
Key-Logic
. Refer to the
Search Logic
section for details.
SORTED-FILE-PATH=
Optional path for the temporary sort data file
srt.tmp
.
SORT-WORK1-PATH=, SORT-WORK2-PATH=
DCE may create sort work files when sorting a large result set. These parameters control the placement of these files and override the values possibly given in the
Project-Definition
.
KEY-FIELD=
The name of the field in the database file which is to be used for key generation purposes. This must be a field defined in the
File-Definition
. It is recommended that any use of this keyword is reviewed and converted to use the newer
Field(List of keyfields)
Search-Logic/Key-Logic
option. For more details, refer to the
Search Logic
section.
CANDIDATE-SET-SIZE-LIMIT=n
Informs the DCE Search Server to process searches by first building a list of candidate records, eliminating duplicates, and then scoring the remainder. This process usually makes scoring more efficient.
n
specifies the maximum number of unique entries in the list. The default limit is 10000 records. A value of 0 disables this processing.
SCHEDULE=<list of jobs>
Comma-separated list of jobs scheduled for this clustering. The jobs listed must be defined in the
job-definition
sections.
UNMATCHED-FILE=
The name of the Logical-File entity that describes the Unmatched File. This file is created when running a clustering job with the
No-New-Clusters
option. When this parameter is defined, the records that did not match any existing clusters are written to the Unmatched File. An output view may be used to format the output file.
OPTIONS=
A comma separated list of keywords used to control various search options:
  • ADD-NULL-KEY SORTIT
    processing is used to sort the input file into preferred key order. If an input record generates a null key and the
    IDX-Definition
    option
    No-Null-Key
    has been specified, the record is not written to the output file and therefore will not be loaded into the IDT. If you wish this record to be loaded but do not want null keys added to the key index, specify the
    Add-Null-Key
    option. This is useful if the record will be reindexed later using a different field.
  • APPEND
    the input file is appended to an existing file in the database. This option is used to merge two or more input files into one clustering database. Normally, the previous clustering information is erased when a file is loaded to the database.
  • AUTO-ID
    generate unique record source Id in the Id-field. This option is needed if source identifiers (Source Id) will be used to identify records.
  • IGNORE-NOTCH-OVERRIDE
    Ignore any adjustments made to the match levels by a search client (Relate or DupFinder) that requests a particular Match-Tolerance. The tolerance is honored but the adjustments are ignored.
  • DELAY
    delay the building of cluster index 2 (used by
    POST
    ); if you use this option and you wish to run a
    POST
    job, you will need to schedule an
    EXTRACT
    job to create the index file.
  • FORMAT
    run the
    PRE
    job to pre-format the raw input data. This option is needed if a job of type PRE is used.
  • PRE-LOAD
    run the
    LOADIT
    job to preload the input data to the database. This option is needed if a job of type
    LOADIT
    is used.
  • SEARCH-NULL-PARTITION
    any search for a record containing a Null-Partition value will search all other partitions. Any search for a record with a non-null partition value will search the null-partition as well. Note that the entire partition value must be null for this to work.
  • SORT-IN
    run the
    SORTIT
    job to sort the input data. This option is needed if a job of type
    SORTIT
    is used.
  • TRUNCATE-SET
    modifies the behavior of
    CANDIDATE-SET-SIZE-LIMIT
    . Searches normally continue until all candidates have been considered.
    Truncate-Set
    will terminate the search once the candidate set is full, thereby limiting the number of candidates that will be considered.

0 COMMENTS

We’d like to hear from you!