User Guide

10.2
- 10.5 HotFix 3
- 10.5 HotFix 2
- 10.5 HotFix 1
- 10.5
- 10.2 HotFix 1
- 10.1
- 10.0 HotFix 1
- 10.0

Back Next

How To

This section is intended to provide a "quick start" in terms of the parameters required to solve some types of business problems using the Data Clustering Engine.

It also contains hints and "tricks of the trade".

How to Re-Cluster Data

The Data Clustering Engine may be used to recluster records that have already been clustered. This can be achieved with either of the following techniques:

Technique #1

Records are loaded into the database as preclustered.

The records are reclustered using the same input file by specifying the

Merge

Clustering-method and the

NO-ADD

Job Option.

The

NO-ADD

option prevents the input records from being re-added and re-indexed on the database. It also prevents new cluster records from being added.

The real work is performed by the

Merge

option. The (re)clustering process will use records from the input file to match and score against records on the database. Records, which reach the scoring threshold, will have their clusters merged. The result is a reclustered file.

A set of sample definitions, which demonstrates this technique, can be found in

test02.sdf

Technique #2

Records are loaded into the database and clustered using a key field such as Name.

A key index is generated for a

new

key field, say Address.

A second clustering step is run to recluster the records using the Address data. This will merge clusters of Names which have matching Addresses.

A set of sample definitions, which demonstrates this technique, can be found in

test03.sdf

Notice the following features:

The

ClusteringID

must be the same for both Clusterings.

Each Clustering’s

Key Index

is named to avoid confusion. If

Key Index

was omitted, the same (default) file name would be used for each clustering.

The second

LOADIT

job specifies

ReIndex

to rebuild the key index using the new

Key-Field (Address)

The Reclustering step specifies that the data is

PRE-LOAD

ed and that a

Merge

operation with

NO-ADD

is to be performed.

Input from a Named Pipe

On Unix platforms the input processor can read input from a named pipe. This means that it is possible to read data from another database without the need to create large intermediate files.

The concept is identical on all Unix platforms, although the command used to create a named pipe may vary between implementations. The following example is applicable to Linux.

mkfifo $SSAPROJ/inpipe

To use the pipe, specify its name as the

Physical-File

parameter in the

Logical-File-Definition

of the input file;


Logical-File-Definition
*======================
NAME=                lf-input
PHYSICAL-FILE=       "+/inpipe"
COMMENT=             "named pipe for the load step."
VIEW=                DATAIN
FORMAT=              TEXT
AUTO-ID-NAME=        Job1

For W32 platforms:

The pipe must be of the "blocking" type created by calling the Windows API function

CreateNamedPipe

before the Data Clustering Engine is instructed to read from the pipe.

To use a named pipe: You need to specify the name of the pipe in Microsoft’s format, which is

\\server\pipe\<pipename>

. That is, two backslashes, the server name (or dot for the current machine), backslash, the word "pipe", another backslash, and then the name of the named pipe file.

Where the

part of the name can include any characters including numbers, spaces and special characters but not backslashes or colons. The entire pipe name string can be up to 256 characters long. Pipe names are not case sensitive.

If you don’t specify something starting with

"\\.\pipe\"

, then an ordinary file will be assumed.

You can specify the file in the SDF. For example:


logical-file-definition
*======================
NAME=                  LF-input
COMMENT=               "named pipe"
PHYSICAL-FILE=         "\\.\pipe\namedpipe"
VIEW= DATAIN
FORMAT= TEXT
AUTO-ID-NAME=          Job1

Reformat Input Data

The

PRE

utility can be used as a standalone tool to reformat files. It can be used to

reorder fields,

delete fields,

combine fields,

insert text between fields.

The strength of this utility comes from its use of SSA-DB’s view processor. An input view is used to describe the layout of the input file (

DATAIN11

in the example below).

The

Logical-File-Definition

describes the name and format of the input file; %SSAPROJ%/data/nm1k.dat and

Text

respectively.

PRE

reads the input file using the input view and transforms the fields to match the output view specified by the

Project-Definition’s FILE=

parameter. The output view is normally described in the file definition section of the SDF Under normal conditions, the output of

PRE

is a compressed binary file called

fmt.tmp

You can disable compression by specifying the

Clustering-Definition’s Options=--Compress-Temp

parameter.

You can generate Text format output by specifying the

Job-Definition’s Output-Options=Trim,CR

parameters.


Project-Definition
*=================
NAME=                  pre-job
ID=                    01
FILE=                  DATA11
DEFAULT-PATH=          "+"
*
Clustering-Definition
*====================
NAME=                  clustering-pre
CLUSTERING-ID=         aa
OPTIONS=               Format, --Compress-Temp
SCHEDULE=              job-pre
*
Job-Definition
*=============
NAME=                  job-pre
TYPE=                  pre
FILE=                  lf-input
OUTPUT-OPTIONS=        Trim, CR
*
*
Logical-File-Definition
*======================
NAME=                  lf-input
PHYSICAL-FILE=         "+/data/nm1k.dat"
COMMENT=               "the input file"
VIEW=                  DATAIN11
FORMAT=                TEXT
*

The input and output views are used to specify how the file is to be modified;

Fields are reordered by changing their relative positions in the input and output views.

A field may be deleted by omitting it from the output view.

Fields can be "combined" by reordering them to be consecutive. The input view for the next phase could then treat the adjacent fields as one "large" field.

Fixed data can be inserted between fields by adding filler(s) to the output view.

Creating an Index and Search-Logic for any DATA field

The DCE utilities can be used to create an index and search any field in the IDT. Under normal circumstances the

KEY-FIELD

is used to generate name-keys using a Key-Logic module. This procedure can be modified to create an index for any field.

By defining an

IDX-Definition

and a

Search-Definition

which names the field to be indexed as the Key-Field and by specifying a Key-Logic and Search-Logic of User, we effectively define a key index and search definition that contains the exact key-value extracted from the DATA record.

For key building (

IDX-definition

KEY-LOGIC=User, Field(Phone)

For search (

Search-Definition or Clustering-Definition

SEARCH-LOGIC=User, Field(Phone)

Multi-clustering Data

You can use the multi-clustering functionality to combine results from several searches into one search. A multi-clustered data set may contain one or more clusters.

You can use the following techniques with multi-clustering to define a "household" clustering strategy that requires searches on name and addresses:

Perform LOAD-IDT and build an index for Name.

The following sample describes how to cluster by Name:


* ---------------------------------------------------------------------
* Clustering by NAME.
* ---------------------------------------------------------------------
clustering-definition
*====================
NAME=                           clustering-name
CLUSTERING-ID=                  AA
IDX=                            t3name
SEARCH-LOGIC=                   SSA,
                                System(default),
                                Population(usa),
Controls("FIELD=Person_Name SEARCH_LEVEL=Typical"),
                                Field(Name)
SCORE-LOGIC=                    SSA,
                                System(default),
                                Population(usa),
                                Controls ("Purpose=Person_Name MATCH_LEVEL=Typical"),
                                Matching-Fields("Name:Person_Name")
OPTIONS=                        Pre-Load
SCHEDULE=                       job-loadit
*
job-definition
*=============
NAME=                           job-loadit
TYPE=                           loadit
FILE=                           lf-input
*

Perform second LOAD-IDT with ReIndex to rebuild the key index using the Address key field.

The following sample describes how to cluster by Address:


* ---------------------------------------------------------------------
* Clustering by Address.
* ---------------------------------------------------------------------
clustering-definition
*====================
NAME=                           clustering-address
CLUSTERING-ID=                  AA
IDX=                            t3addr
SEARCH-LOGIC=                   SSA,
                                System(testpops),
                                Population(usa),
                                Controls("FIELD=Address_Part1"),
                                Field(Addr)
SCORE-LOGIC=                    SSA,
                                System(testpops),
                                Population(usa),
                                Controls ("Purpose=Address"),
                                Matching-Fields("Addr:Address_Part1")
OPTIONS=                        Append, Pre-Load
SCHEDULE=                       job-ca-loadit
*
job-definition
*=============
NAME=                           job-ca-loadit
TYPE=                           loadit
OPTIONS=                        Re-Index
*

Add a multi-clustering definition to create a household cluster with

Name

and

Address

The following sample describes how to create a multi-clustering definition:


* ------------------------------------------------------
* MULTI-CLUSTERING by Name and Address
* ------------------------------------------------------
MULTI-CLUSTERING-DEFINITION
*======================
NAME=                           MULTI-CLUSTERING-nfs
CLUSTERING-ID=                  AA
IDT-NAME=                       DATA-100
CLUSTERING-LIST=                clustering-name,
                                clustering-address
SCHEDULE=                       job-cluster,
                                job-ca-post-plural-1,
                                job-ca-post-single-1,
                                job-post-all-1
*

Use the following rules when you work with multi-clustering definition:

The Clustering ID must be the same for both clustering-definition and multi-clustering definition.

Name the key index of each Clustering to avoid confusion. If you omit a key index, the default file name is used for each clustering.

Ensure that the second LOAD-IDT job specifies ReIndex to rebuild the key index using the new key field, for example, Address.

Invoke Clustering job step from multi-clustering definition and not from the individual clustering-definition.

The following definition describes how to invoke clustering job step from multi-clustering definition:


MULTI-CLUSTERING-DEFINITION
*======================
NAME=                           MULTI-CLUSTERING-nfs
CLUSTERING-ID=                  AA
IDT-NAME=                       DATA-100
CLUSTERING-LIST=                clustering-name,
                                clustering-address
SCHEDULE=                       job-cluster,
                                job-ca-post-plural-1,
                                job-ca-post-single-1,
                                job-post-all-1
*
			
job-definition
*=============
NAME=                           job-cluster
TYPE=                           cluster
CLUSTERING-METHOD=              Merge
*

Rename Saved Search

Table of Contents

User Guide

User Guide

How To

How To

How to Re-Cluster Data

Technique #1

Technique #2

Input from a Named Pipe

For W32 platforms:

Reformat Input Data

Creating an Index and Search-Logic for any DATA field

Multi-clustering Data