Matching Projects in DQS

In this, the sixth installment of the DQS series we are going to move to the next level and demonstrate the ins and outs of a matching project using the matching policy we created in the last post. For brevity, we will skip over most of the general info associated with Data Quality Projects since it  was covered in prior posts. If you have not been following along you can catch up below.

DQS Blog Series Index

Part 1: Getting Started with Data Quality Services (DQS) 2012

Part 2: Building Out a Knowledge Base

Part 3: Knowledge Discovery in DQS

Part 4: Data Cleansing in DQS

Part 5 : Building a Matching Policy in DQS

Part 6: Matching Projects in DQS

Part 7: Activity Monitoring, Configuration & Security in DQS

Prerequisites

The prerequisite before undertaking a matching project is that the data set you are working with has been cleansed. Cleansed in this broad sense means that you have taken the time to ensure that the data is as complete as it came be,  checked for both validity and accuracy and finally, standardized. In fact the cleaner the data the better the matching results will be. That said the cleansing activity in DQS was design to do exactly these things and should be utilized before you start matching.

The Matching Project

To kick the project off, launch the Data Quality Client and connect to the DQS server. When you select “New Data Quality Project” from the Data Quality Client, the two options for activity types you have are Cleansing and Matching. We will obviously be dialing in on marching. Before we dive in, it’s worth pointing out that the Matching activity option may be greyed-out indicating the activity type is not available. This happens when the knowledge base that is selected does not have a published matching policy, which in our case will not be an issue.

2-20-2012-3-45-02-PM_thumb

Create Matching Project

With that, go ahead enter “Demo Matching” as the project name on the New Data Quality Project page and ensure the Sports Teams knowledge base and the Cleansing activity are selected, then click “Next” to launch the three step wizard.

The first step of the wizard is the set-up in which you map the knowledge base to your source of data. You have the same two options for sourcing your data as in the knowledge discovery and cleansing process (SQL Server and Excel file). In fact, the mapping screens are nearly identical between the three activities. The difference here is above the knowledge base details on the right-hand side of the screen. Provided for you is a list of domains which are defined in the matching policy rules and are required to perform a matching project.

2-20-2012-4-05-55-PM_thumb

Mapping the Match Data Source

Using the “SportsTeamsMatching.xls” provided for you in the Resources section, choose “Excel File” at the Data Source prompt, select the local copy of the demo Excel file and then choose the ?SportsTeams$? worksheet. The first row of the worksheet has headers, so leave the “Use first row as header” option checked.

Notice that by default, mapping are done for you if the DQS Client can determine the appropriate mappings based on the Source Column name. The next thing to point out is that only domains included in matching policy rules are mapped. Meaning that even though we have domains for City and Stadium Capacity they are not mapped to Source Columns. Once you have finished mapping your source data, click “Next”.

On the Matching step, the paradigm is again the same as we have seen with prior activities. Clicking ?Start? loads the data from the Excel file to the DQS server for processing. Once the processing is complete, you can preview and modify the results by rejecting matches if necessary. The results are filtered either by matching clusters or by unmatched records and the Filter prompt will allow you to switch between the two views.

2-20-2012-4-09-50-PM_thumb2

DQS Matching Preprocessing

It is at this point that you have the option of using either overlapping or non-overlapping clusters. Overlapping clusters will result in potential more clusters and as the name implies there may be some overlap in that a single record can belong to multiple clusters. Non-overlapping clusters group records into clusters yet keeps the records distinct across the data set so that there is no overlap. To switch between the two matching options, you must restart the matching process by click the “Restart” option. Finally, before we move one are the profiling tabs.

2-20-2012-4-23-48-PM_thumb
Matching Results Overlapping Clusters
2-20-2012-4-24-06-PM_thumb
Matching Results Non-Overlapping Clusters

As in other activities the Profiler tab provides you with additional info about the data set that was processed. The two most relevant tabs to this activity are the Matching Rules and Matching Results tabs which are the same as we saw when we put together the matching rules for our knowledge base.

2-20-2012-4-30-04-PM_thumb
Matching Rules Tab
The Matching Rules tab allows you to browse the matching rules that were used and provides a square color key that can be used to determine which rule was used for each match record to build the groups.
2-20-2012-4-30-11-PM_thumb
Matching Results Tab
The Matching Results tab provides statistical information about the matching activity such as percentage matched and unmatched, number of clusters as well as average, minimum and maximum cluster size.

Click “Next” to move to the Export step which is the final step of the wizard.

The export step offers you two formats to get the matched data out of your project. The first is SQL Server which has the same limitations as we noted several times in that you must chose a database that resides on the same server as the instance of DQS. The DQS_Staging database would be a good option if this is the route you choose. The second option and the one we will use for the purpose of this demo is to export to a CSV file.

2-20-2012-4-42-32-PM_thumb1

Matching Export

 

As part of the export process, there are two types of exports that are available. The first is the matching results which contains the records grouped by cluster, the cluster and matching information and the matched records. The second is a survivorship export which in essence de-duplicates the records and returns  only the leading record for each cluster as well as any unmatched records. The leading record that is returned is configured based on one of the four options:

1. Pivot Record (a randomly record in the cluster identified by DQS)

2. Most complete and longest record

3. Most complete record

4. Longest record

When multiple records in a cluster meet the specified survivorship qualification, DQS randomly picks one of the records as the leading record.

2-20-2012-4-43-53-PM_thumb
Matching Export CSV
2-20-2012-4-44-08-PM_thumb
Survivorship Export CSV

With you export complete, click “Finish” to complete the project.

Wrap-Up

As you can see, the DQS matching activity can be a very important and powerful activity. Whether you are taking on a customer de-duplication project or doing householding for a marketing campaign, the matching functionality provided by DQS is a powerful and useful tool in helping you successfully complete the task at hand.

The next blog post will be the final in this series. We will review the monitoring, security and configuration features available in DQS.

Resources

1. SportsTeamMatching.xls

2. Complete DQS Sports Team Knowledge Base (Rename extension from .txt to .dqs)

Till next time!!

Chris

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s