Over the course of the last four blog posts we have explored the features and functionality of Data Quality Services 2012, created a knowledge base, trained it through the knowledge discovery process and finally used it to perform data cleansing. In this blog post we will take the next step by building out a matching policy. Before we dive in let’s put some context around what matching in DQS means.
DQS Blog Series Index
Part 2: Building Out a Knowledge Base
Part 3: Knowledge Discovery in DQS
Part 4: Data Cleansing in DQS
Part 5 : Building a Matching Policy in DQS
Part 6: Matching Projects in DQS
One of the five major tenants of measuring and ensuring data quality is uniqueness (the other four being standardization, completeness, accuracy and validity). It makes sense then, that DQS provide as a core functionality the ability to perform matching.
Matching within DQS is policy-driven or rules-based and as a result provides a measure of the degree of uniqueness within a single set of data. When a record is matched by the DQS algorithm to one or more other records within the set of data it can take one of two forms. The first and more straight-forward is a exact match. This is pretty much self-explanatory. The second form is called an approximate match. In this case DQS found similarities between multiple records and a weighed-probability is returned to reflect the degree to which the records are similar.
Matches are grouped or arranged by DQS into clusters. These clusters can be set as either overlapping or non-overlapping. For overlapping clusters, a single record may participate in multiple match clusters. Non-overlapping then is the inverse of that in that a record only participates in the cluster identified as a best match.
Now that we have a solid foundation of understanding let’s add a matching policy to our knowledge base.
The Matching Policy
To start building the matching policy, click the arrow to the right of the knowledge base to bring up the context menu. On the context menu, select “Matching Policy”. This launches a three-step wizard which walks you the creating and testing a matching policy.
Matching Policy on Context Menu
The first step of wizard sets-up a sample data set used to evaluate the rules created. It should look very familiar if you have been following along with my previous blog posts. On the left you have prompts for specifying and mapping the source of your sample data. On the right, you have a preview of the knowledge, including domains, composite domains and matching rules.
Match Policy Wizard – Map Data Source
We will work with an Excel file, so make the appropriate selection at the data source prompt, then navigate to your local copy of the SportsTeamsBadData.xls file at the Excel File prompt. Select “SportsTeams$” as the Worksheet prompt and finally, leave the “”Use first row as header” option selected.
The mapping sections is the same as we saw previously as well. You have the options to add or remove mapping rows as well as an option to preview the data source. It should be pointed out that at this step you have the option to add a new domain if required. For the demo we are going to create a rule around the State column in the data set. Since we have not previously worked with this column we will need to create a new domain.
Create Domain Dialog
View/Select Composite Domain
To create a new domain, click the “Create a domain” icon. On the Create Domain dialog, enter “State” at the Domain Name prompt and accept the defaults for all other fields. Click “OK” to finish creating the domain. Now that we have all the required domains, map the Team, League, Stadium, Capacity and State columns to the appropriate domains. Before we move on we need to de-select the composite domain because we want to work with single domains when creating our rules. You can make this selection by clicking the “View/Select Composite Domains” button and then unselecting the SportsTreams composite domain. When you are finished, click “Next” to move to the next step in the wizard.
Match Policy Wizard – Create Matching Policy
Before we jump in and start creating the matching rules, let’s talk about the two rules we need to create for our matching policy. Since we are working with sports teams where team name along does not make them distinct we have to build our rules in such a way that we don’t create incorrect matches. For example, we wouldn’t want to match a record for the baseball Giants to the football Giants or the football Cardinals to the baseball Cardinals.
To prevent these we will build rules that consider more than just a single field. The first rule will look at league and team name. We will set the rule up so that the league is a prerequisite match and the we can use a more fuzzy-type match for the team name. Our second rule will be similar to the first except we will use the state domain and team name. We will use this rule as a follow-through for the first just in case the league is blank or null.
To create the first rule, click the “Create a matching rule” icon. A new rule is created with a default name and a red “X” icon indicating the rule is not yet valid. The rule details and rule editor on the right is where the set-up, configuration and testing occurs.
Start by naming the first rule, “League and Team Name”. We will leave the minimum matching score at 80% for the time being. If you don’t notice it at first, the floor is set at 80% and you can’t set the score lower than that. If you need to change that value to a lower score the minimum record score property in the general setting configuration section will allow to alter the floor value.
Admin General Settings
To define the first part of our rule, click the “Add a new domain element” icon above the rule editor. A new row is added to the rule editor grid. For the domain, select “League” and then check the Prerequisite checkbox. Notice that the similarity field changed to Exact and is locked. This is the behavior for a prerequisite match as its make the domain match all or nothing.
Matching Policy ? Rule #1
Click the “Add a new domain element” icon again to add the second part to our first rule. This time we want the Team domain. Leave “Similar” for the similarity field and change the weight to 100%. The weight column is useful if you are matching on multiple fields and need to have each contribute a different (larger or smaller) ratio of the matching score. In this case we are only matching on one field so the weight must be 100%.
Finally, we are ready to test the rule. Click “Start” to run the test.
Once the processing completes you should notice three clusters (if you used the SportsTeamsBadData,xls file). The grid contains the raw record data as well as:
1. A Record Id, this id is added by DQS.
2. A Cluster column that identifies the record that the matches cluster around.
3. A Score, which is the match percent to the cluster record.
You can evaluate the clusters to determine if your rule is behaving as you expect. There’s a couple things I want to point out that is important. First, is the option to switch between matched and unmatched records. This functionality is pretty much self-explanatory.
The next is an important one and something that is much more subtle. Notice in the “Tampa Bay Buccaneers” cluster that both Bucs and Buccaneers match score is 100%. Why do you think that is? The answer is actually fairly simple. When we created and trained our knowledge base we created both of those as synonyms to Tampa Bay Buccaneers. That’s the power of DQS and more specifically the knowledge base.
Now that the we are satisfied that the first rule is working as we planned, we can move to the second rule. For this rule we are going to take a different route to illustrate how the weightings work. Go ahead and create the new rule and name it “State and Team Name”.
Matching Policy – Rule #2
Add a domain element to the Rule Editor and select the “State” domain. For the similarity, select “Exact” and set the weight to 25%. Next add a domain element for Team, leave similar as the similarity type and set the weight to 75%. What we have done is effectively recreate what we did in the first rule when we click the Prerequisite option.
There’s danger in doing it this way however. Doing it this way works as long as the minimum matching score is greater than 75%. If the minimum matching score goes lower, it would be possible for the Team domain to fulfill the minimum score potentially leading to bad matches. Needless to say you wouldn’t want to ever do this in the real-world and it was only done here for illustrative purposes.
Once, you have created the rule, start to the test and make sure the results are acceptable. At this point, I want to call your attention to an option we have ignored until now. That would be radio button list that by default is set to “Execute on previous data”. The second option is “Reload data from source”. This is a very useful feature if you need to modify your sample data to test a specific scenario. For the demo, you can tweak either State or Team values to see how it impacts the matching results. One your testing is complete we are ready to move to the next step, click “Next” to move to the final step of the wizard.
The final step of the wizard, uses the rules and policy you have define to match against the sample data set. Click start to process the sample data against the matching policy. Once the process is finished, the results which are very similar to those that you saw during testing. You have many of the same options as far as reloading data and filtering matched versus unmatched records. This is also the perfect place to call your attention to overlapping and non-overlapping clusters.
Matching Policy – Matching Results
For the first run, we used overlapping clusters. Look carefully at the matches and you will notice that record 1000065 “Tampa Bay Bucaneers” is participating in multiple clusters. This may or may not be the behavior we expect. If it is not, change the drop-down to non-overlapping clusters and restart the process. When you do, the number of clusters drops from three to two and there is no overlap.
Matching Results ? Overlapping Clusters
Matching Results ? Non-Overlapping Clusters
The last thing to discuss before we wrap-up is the profiler. There are three tabs that contain useful profiling information. The first, is one that you should be fairly familiar with is the Profiler tab. This tab contains the various source statistics both at the aggregate and domain level. Because we added a new domain for State you will see a notification bubble telling us there is a significant number of new domain values for the domain. This could be a case where we need to run the knowledge discovery process again to add knowledge to our knowledge base.
The second tab, Matching Rules, give you an overview of the rules that built into the knowledge base and the ability to interrogate the rules to see which domains are included and how they participate.
Matching Rules Tab
The third and final tab is the Matching Results tab. This tab contains the matching statistics such as percent matched and percent unmatched as well as number of clusters and the minimum, maximum and average size of the clusters.
Matching Results Tab
To review, we took our Sports Teams knowledge base and developed a matching policy with a couple of simple rules that will allow a data steward or IT Pro to use our knowledge to perform meaningful matching. The rules we built contain relevant domain knowledge (i.e. Team Names alone are not unique) that is required for any matching activity within our domain data.
In the next blog post, we will take our matching policy and put it to work in matching Data Quality Project.
Till next time!!