You are a data steward at a bank. You are concerned that multiple records in the customer account tables might contain the same information. The duplicate records might represent data entry errors, or they might represent fraudulent customer activity.
You define the following process to find the duplicate records and to identify a single preferred version of each set of records:
You ask a developer to configure one or more mappings to identify the duplicate records.
The mappings calculate a set of numeric scores that represent the levels of duplication between the data values in the records. High scores indicate duplicate records, and low scores indicate unique records. Some records have marginal scores that indicate that the duplicate status of the records is uncertain.
The developer configures an additional mapping that reads the numeric scores. The developer adds the mapping to a workflow that includes a Mapping task and a Human task.
The Mapping task runs the mapping. The Mapping task writes the records to different tables based on the scores that they contain.
The Human task distributes the records with marginal scores to tasks that you and other users can open in the Analyst tool.
You log in to the Analyst tool, and you open a task.
The Analyst tool organizes the records in a series of clusters. Each cluster contains two or more records that contain similar information. By default, the first record in a cluster is the preferred record.
Open a cluster, and analyze the records that it contains.
You perform the following actions in each cluster:
You examine the data values in each column of record data. You select the most accurate value in each column and promote the value to the preferred record.
You can edit the values that you select, and you can search for records that contain common values in other clusters.
If a record does not belong in the current cluster, you move it to another cluster or you create a cluster for the record.
You update the cluster status to indicate that you reviewed the cluster. You complete the task when you verify the current preferred record in every cluster.
Before you update a record, verify that the task is open in edit mode. To enter edit mode, click the
Edit
button in the open task.
When you finish work on all of the clusters in the task, you update the task status. The task status indicates that the records are ready for the next stage in the data quality process.
The next stage for the data depends on the configuration of the Human task. For example, the Human task might include additional steps that assign the clusters to other users for review.
When the Human task completes, the next stage of the workflow begins.