Table of Contents

Search

  1. Preface
  2. Introduction
  3. Installation
  4. Design
  5. Operation

How To

How To

This section is intended to provide a "quick start" in terms of the parameters required to solve some types of business problems using the Data Clustering Engine.
It also contains hints and "tricks of the trade".

How to Re-Cluster Data

The Data Clustering Engine may be used to recluster records that have already been clustered. This can be achieved with either of the following techniques:

Technique #1

  • Records are loaded into the database as preclustered.
  • The records are reclustered using the same input file by specifying the
    Merge
    Clustering-method and the
    NO-ADD
    Job Option.
The
NO-ADD
option prevents the input records from being re-added and re-indexed on the database. It also prevents new cluster records from being added.
The real work is performed by the
Merge
option. The (re)clustering process will use records from the input file to match and score against records on the database. Records, which reach the scoring threshold, will have their clusters merged. The result is a reclustered file.
A set of sample definitions, which demonstrates this technique, can be found in
test02.sdf
.

Technique #2

  • Records are loaded into the database and clustered using a key field such as Name.
  • A key index is generated for a
    new
    key field, say Address.
  • A second clustering step is run to recluster the records using the Address data. This will merge clusters of Names which have matching Addresses.
A set of sample definitions, which demonstrates this technique, can be found in
test03.sdf
.
Notice the following features:
  • The
    ClusteringID
    must be the same for both Clusterings.
  • Each Clustering’s
    Key Index
    is named to avoid confusion. If
    Key Index
    was omitted, the same (default) file name would be used for each clustering.
  • The second
    LOADIT
    job specifies
    ReIndex
    to rebuild the key index using the new
    Key-Field (Address)
    .
  • The Reclustering step specifies that the data is
    PRE-LOAD
    ed and that a
    Merge
    operation with
    NO-ADD
    is to be performed.

Input from a Named Pipe

On Unix platforms the input processor can read input from a named pipe. This means that it is possible to read data from another database without the need to create large intermediate files.
The concept is identical on all Unix platforms, although the command used to create a named pipe may vary between implementations. The following example is applicable to Linux.
mkfifo $SSAPROJ/inpipe
To use the pipe, specify its name as the
Physical-File
parameter in the
Logical-File-Definition
of the input file;
Logical-File-Definition *====================== NAME= lf-input PHYSICAL-FILE= "+/inpipe" COMMENT= "named pipe for the load step." VIEW= DATAIN FORMAT= TEXT AUTO-ID-NAME= Job1

For W32 platforms:

The pipe must be of the "blocking" type created by calling the Windows API function
CreateNamedPipe
before the Data Clustering Engine is instructed to read from the pipe.
To use a named pipe: You need to specify the name of the pipe in Microsoft’s format, which is
\\server\pipe\<pipename>
. That is, two backslashes, the server name (or dot for the current machine), backslash, the word "pipe", another backslash, and then the name of the named pipe file.
Where the
<pipename>
part of the name can include any characters including numbers, spaces and special characters but not backslashes or colons. The entire pipe name string can be up to 256 characters long. Pipe names are not case sensitive.
If you don’t specify something starting with
"\\.\pipe\"
, then an ordinary file will be assumed.
You can specify the file in the SDF. For example:
logical-file-definition *====================== NAME= LF-input COMMENT= "named pipe" PHYSICAL-FILE= "\\.\pipe\namedpipe" VIEW= DATAIN FORMAT= TEXT AUTO-ID-NAME= Job1

Reformat Input Data

The
PRE
utility can be used as a standalone tool to reformat files. It can be used to
  • reorder fields,
  • delete fields,
  • combine fields,
  • insert text between fields.
The strength of this utility comes from its use of SSA-DB’s view processor. An input view is used to describe the layout of the input file (
DATAIN11
in the example below).
The
Logical-File-Definition
describes the name and format of the input file; %SSAPROJ%/data/nm1k.dat and
Text
respectively.
PRE
reads the input file using the input view and transforms the fields to match the output view specified by the
Project-Definition’s FILE=
parameter. The output view is normally described in the file definition section of the SDF Under normal conditions, the output of
PRE
is a compressed binary file called
fmt.tmp
.
You can disable compression by specifying the
Clustering-Definition’s Options=--Compress-Temp
parameter.
You can generate Text format output by specifying the
Job-Definition’s Output-Options=Trim,CR
parameters.
Project-Definition *================= NAME= pre-job ID= 01 FILE= DATA11 DEFAULT-PATH= "+" * Clustering-Definition *==================== NAME= clustering-pre CLUSTERING-ID= aa OPTIONS= Format, --Compress-Temp SCHEDULE= job-pre * Job-Definition *============= NAME= job-pre TYPE= pre FILE= lf-input OUTPUT-OPTIONS= Trim, CR * * Logical-File-Definition *====================== NAME= lf-input PHYSICAL-FILE= "+/data/nm1k.dat" COMMENT= "the input file" VIEW= DATAIN11 FORMAT= TEXT *
The input and output views are used to specify how the file is to be modified;
  • Fields are reordered by changing their relative positions in the input and output views.
  • A field may be deleted by omitting it from the output view.
  • Fields can be "combined" by reordering them to be consecutive. The input view for the next phase could then treat the adjacent fields as one "large" field.
  • Fixed data can be inserted between fields by adding filler(s) to the output view.

Creating an Index and Search-Logic for any DATA field

The DCE utilities can be used to create an index and search any field in the IDT. Under normal circumstances the
KEY-FIELD
is used to generate name-keys using a Key-Logic module. This procedure can be modified to create an index for any field.
By defining an
IDX-Definition
and a
Search-Definition
which names the field to be indexed as the Key-Field and by specifying a Key-Logic and Search-Logic of User, we effectively define a key index and search definition that contains the exact key-value extracted from the DATA record.
For key building (
IDX-definition
):
KEY-LOGIC=User, Field(Phone)
For search (
Search-Definition or Clustering-Definition
):
SEARCH-LOGIC=User, Field(Phone)

Multi-clustering Data

You can use the multi-clustering functionality to combine results from several searches into one search. A multi-clustered data set may contain one or more clusters.
You can use the following techniques with multi-clustering to define a "household" clustering strategy that requires searches on name and addresses:
  • Perform LOAD-IDT and build an index for Name.
    The following sample describes how to cluster by Name:
    * --------------------------------------------------------------------- * Clustering by NAME. * --------------------------------------------------------------------- clustering-definition *==================== NAME= clustering-name CLUSTERING-ID= AA IDX= t3name SEARCH-LOGIC= SSA, System(default), Population(usa), Controls("FIELD=Person_Name SEARCH_LEVEL=Typical"), Field(Name) SCORE-LOGIC= SSA, System(default), Population(usa), Controls ("Purpose=Person_Name MATCH_LEVEL=Typical"), Matching-Fields("Name:Person_Name") OPTIONS= Pre-Load SCHEDULE= job-loadit * job-definition *============= NAME= job-loadit TYPE= loadit FILE= lf-input *
  • Perform second LOAD-IDT with ReIndex to rebuild the key index using the Address key field.
    The following sample describes how to cluster by Address:
    * --------------------------------------------------------------------- * Clustering by Address. * --------------------------------------------------------------------- clustering-definition *==================== NAME= clustering-address CLUSTERING-ID= AA IDX= t3addr SEARCH-LOGIC= SSA, System(testpops), Population(usa), Controls("FIELD=Address_Part1"), Field(Addr) SCORE-LOGIC= SSA, System(testpops), Population(usa), Controls ("Purpose=Address"), Matching-Fields("Addr:Address_Part1") OPTIONS= Append, Pre-Load SCHEDULE= job-ca-loadit * job-definition *============= NAME= job-ca-loadit TYPE= loadit OPTIONS= Re-Index *
  • Add a multi-clustering definition to create a household cluster with
    Name
    and
    Address
    .
    The following sample describes how to create a multi-clustering definition:
    * ------------------------------------------------------ * MULTI-CLUSTERING by Name and Address * ------------------------------------------------------ MULTI-CLUSTERING-DEFINITION *====================== NAME= MULTI-CLUSTERING-nfs CLUSTERING-ID= AA IDT-NAME= DATA-100 CLUSTERING-LIST= clustering-name, clustering-address SCHEDULE= job-cluster, job-ca-post-plural-1, job-ca-post-single-1, job-post-all-1 *
Use the following rules when you work with multi-clustering definition:
  • The Clustering ID must be the same for both clustering-definition and multi-clustering definition.
  • Name the key index of each Clustering to avoid confusion. If you omit a key index, the default file name is used for each clustering.
  • Ensure that the second LOAD-IDT job specifies ReIndex to rebuild the key index using the new key field, for example, Address.
  • Invoke Clustering job step from multi-clustering definition and not from the individual clustering-definition.
    The following definition describes how to invoke clustering job step from multi-clustering definition:
    MULTI-CLUSTERING-DEFINITION *====================== NAME= MULTI-CLUSTERING-nfs CLUSTERING-ID= AA IDT-NAME= DATA-100 CLUSTERING-LIST= clustering-name, clustering-address SCHEDULE= job-cluster, job-ca-post-plural-1, job-ca-post-single-1, job-post-all-1 * job-definition *============= NAME= job-cluster TYPE= cluster CLUSTERING-METHOD= Merge *

0 COMMENTS

We’d like to hear from you!