User Guide

10.5 HotFix 2
- 10.5 HotFix 3
- 10.5 HotFix 1
- 10.5
- 10.2 HotFix 1
- 10.2
- 10.1
- 10.0 HotFix 1
- 10.0

Back Next

Partitions

This approach is used to reduce the size of the candidate set.

For very large files, the key generated from the

KEY-FIELD

may have a high selectivity due to the sheer volume of data present on the file. Therefore searching for candidates using the key will create very large candidate sets.

If the nature of the data is well understood, it may be possible to qualify the key with additional data from the record so that the "qualified key" becomes more selective.

The

PARTITION

option instructs the Data Clustering Engine to build a concatenated key from the

KEY-FIELD

and up to five fields/sub-fields taken from the data record. The partition information forms a high-order qualifier for the key (it is prefixed to the key).

For example, an application may wish to cluster all names in a telephone directory. If we are willing to only examine candidates from a particular region, we could partition the data using a post-code or some other information that can group the candidates into regions. Performance is improved by reducing the size of candidate sets. The disadvantage is that candidates will only be selected from within regions; not outside. If this makes sense from the perspective of the "business problem" being solved then partitioning can be used.

The Data Clustering Engine has an option to write statistics to the database log. If this option is set and you choose to cluster using a partition with many values, statistics will be written for each partition.

This overhead can become prohibitive for very many partitions. In this case, database logging should be turned off, otherwise the cost of logging might undo the gains made by partitioning. Refer to the

Verbosity Options

section.

Performance Optimization

Download Guide

Watch

Comments

Communities

Knowledge Base

Success Portal