Knowledge Discovery in DQS

In my last post, we set-up and configured a DQS knowledge base. In this post we are going to explore the DQS knowledge discovery process.

Before we get started let’s recap our demo scenario. We are working with set of data that contains all MLB and NFL teams, the league they play in, their stadium names and capacities. Our knowledge base was built using a domain for each data element (Team, League, Stadium, Capacity). We also constructed a composite domain to allow for cross-domain validation and then built-in a series of business rules to ensure accuracy of our data. I have exported a copy of the knowledge base built during the last post and it is available in the Resources section below.

DQS Blog Series Index

Part 1: Getting Started with Data Quality Services (DQS) 2012

Part 2: Building Out a Knowledge Base

Part 3: Knowledge Discovery in DQS

Part 4: Data Cleansing in DQS

Part 5 : Building a Matching Policy in DQS

Part 6: Matching Projects in DQS

Part 7: Activity Monitoring, Configuration & Security in DQS

Knowledge Discovery

The power of DQS is in knowledge that it maintains about your data. Acquiring this knowledge occurs through the Knowledge Discovery process. In DQS knowledge discovery is split into two parts. This first part is a computer-assisted process in which you map a sample dataset to be analyzed by algorithms built into DQS. The second is an interactive process which engages the data steward to add to and correct the knowledge within the knowledge base. With that let’s run through the knowledge discovery process for our demo.

To start, launch the Data Quality Client and connect to the DQS server. Next locate the SportsTeamsKB knowledge base under the Knowledge Base Management section. Click the arrow to the right of the knowledge base name to bring up the contact menu and then choose “Knowledge Discovery”.

1-26-2012 5-01-27 PM

Launch Knowledge Discovery

 

The Knowledge Discovery Wizard will launch and you will notice three distinct steps (Map, Discover, and Manage Domain Values). In the mapping step you have two options for your Data Source, either SQL Server or Excel File. Although these options are limited, with a little imagination you have some flexibility in that you can use the resources available in SQL Server to connect to other data sources (i.e. Linked Servers and Views).

Let’s first look at the SQL Server data source option. The first and most glaring thing you should notice is that you are prompted only for the a database and table/view. There is no option for either server nor security credentials. This means Windows Authentication to the DQS Server is what you are left to work with.

1-26-2012 5-24-43 PM

KB Discovery Map ? SQL Server Source

The second option and the one we will use throughout the demo is the Excel File. Navigate at the”Excel File” prompt to your copy of the SportsTeamsDemo.xls (available for download down in the resources section). Next, select the “SportTeams$” worksheet from the worksheet drop-down list. Our dataset has headers so leave the “User first row as header” checked.

1-26-2012 5-24-32 PM

KB Discovery Map ? Excel File

Before we start mapping our data set to the domains in our knowledge base, I want to point out the icons that are available over the Source Column/Domain grid. First is the fairly standard Add/Remove Column Mapping options which provide you additional rows to map your columns. Next, you have the option to create both new single and composite domains from within the discovery process. This is a useful feature if you find your dataset has useful columns that were not previously defined as domains. You can also preview your data set which can be useful if you don’t have meaningful column names. Finally, I’ll point out the Profiling tab which will become more meaningful in the next step.

Now we are ready to do the mapping. For the purposes of this demo the process is straight-forward. Map the Team, League, Stadium and Capacity domains to their respective columns in the Excel data set.

Based on the columns available in your dataset, the View/Select Composite Domain button is enabled. This button only becomes active if a composite domain is eligible for use, meaning you have mapped all the domains within the composite domain to dataset columns. Within the dialog, you have the ability to deselect the composite domain if you choose not to use it.

1-26-2012 5-50-53 PM

View/Select Composite Domains Dialog

With that we are ready to move onward! Clicking the “Next” button takes you to the Discovery step. To kick-off the discovery process click the “Start” button. Once the process starts you will notice that it loads the data to the DQS server during the pre-processing, then applies domain rules and finally runs the discovery process.

1-26-2012 5-53-24 PM
Knowledge Discovery ? Start
1-26-2012 5-56-09 PM
Knowledge Discovery ? Complete

Our demo has only 62 records so the process occurs fairly fast. When it finishes you will notice that Profiler window is available and has been populated with statistics based on your data sample. We are able to view the number of new and unique values both at the aggregated dataset level and at the domain level. Also available is a visual representation of the completeness which would show the number of missing values for the records processed. After you review the profiling results, click “Next” to move to the step that allows you to interactively manage domain values.

Once on the Manage Domain Values screen several things should be obvious. The first is the list of domains on the left along with their respective count of distinct values. On the right-hand side you will see a list the list of distinct domain values there were found in the sample data as well as any other domain values that exist from any prior knowledge discovery. You can click on the domains to navigate through the selections.

1-27-2012 8-43-03 AM

Manage Domain Values

The panel on the right is titled with the name of the domain and the domain statistics. Viewing the domain values you will notice that there is a star icon next to each value. This icon indicates that the value is new. While it is not as useful during the initial training of the knowledge base it certainly is more valuable in subsequent iterations. You may also notice red-wiggly lines underneath values. These are used to highlight issue that were identified by the DQS speller.

1-27-2012 8-44-27 AM

DQS Speller Issues

The next thing to point option are the tools that are available for working with the domain values. Starting from left-to-right, you have a Find tool that will search the list of domain values (and allow you to navigate to the matches). Next, you can filter by value type and/or restrict the view to only new values. Following that, you have icons that allow you to add and delete values, link and unlink synonyms, set leading values (more on synonyms and leading values shortly), and expand/collapse values. The final tools that are available on a context menu. These allow you to change the value type (Valid, Invalid, Error), Show/Hide the History Panel and finally Enable/Disable the Speller.

1-27-2012 9-33-53 AM

Manage Domain Values Tools/Features

Now that we have done an overview of what’s available let’s work with out domain values. Let’s start work in our Team domain. To do so, click the Team domain to make it active. Use the Find feature to find the “Los Angeles Angels”. Once you find the value, note that the formal name is “Los Angeles Angels of Anaheim”. While this is the valid proper name we want to use an alternative value that shortens the name.

Start by clicking the “Define a new domain value” icon. This adds a new row to the domain values. In the new row, enter “Los Angeles Angels” as the value and hit the enter key. Unless you’ve change the default sorting, the values are alpha sorted. Scroll and find both the initial value that was profiled and the new value that you just created. Next select both row by using Ctrl-Click sequence. We can define the values as synonyms by next clicking the “Set selected domain values as synonyms” icon.

When you click the icon, notice that “Los Angeles Angels” now appears in the Correct To column. This happens because the value we created is considered the leading value. You can change the leading value by selecting any value in the synonym group that you choose and then clicking the “Set selected domain value as a leading value of its group” icon.

1-27-2012 10-32-30 AM

Team Domain Synonym Groups

As an exercise before we wrap up try creating the following new values and then link them in synonym groups as define:

1-27-2012 10-32-51 AM

Additional Team Domain Synonym Groups

Value Synonym Leading Value
Buccaneers Tampa Bay Buccaneers Tampa Bay Buccaneers
Bucs Tampa Bay Buccaneers Tampa Bay Buccaneers
Ray Tampa Bay Rays Tampa Bay Rays

The same process we used to define synonyms can be used to handle other scenarios that frequently occur. Common misspelling can be entered, marked as an error type and then you can specify a correct to value. The same can be done for invalid values that may typical appear.

Before we finish the knowledge discovery process, its important to understand the value types. The three types for values and their meaning in the DQS world are as follows:

1. Correct (Green Check)? pretty much self-explanatory, a valid domain value

2. Error (Red ?X’) ? the value belongs to the domain by there is a technical or syntax problem with the value. Such as a misspelling.

3. Invalid (Yellow Triangle)? the value does not belong to the domain

1-27-2012 8-43-19 AM

We are now ready to move on. Click “Finish” and then “Publish” to complete the knowledge discovery process.

To briefly recap, we walked through both the computer-assisted and interactive portions of the DQS knowledge discovery process. We explored the tools and features that are available and finally worked with the domain values that were found through the discovery process.

In my next post, we will walk through using the SportsTeamsKB to do cleansing both through the Data Quality Client and with the  DQS Cleansing Transform available in SSIS.

Resrouces

1. SportsTeamsDemo.xls – Data file to use for Knowledge Discovery

Till next time!

Chris

Advertisements

One thought on “Knowledge Discovery in DQS

  1. Hi,

    I have one field (CustomTargeting) in my input file( 3 sample fields as below).

    1)CustomTargeting=(location=omp_subheaderads;brand=golds_gym;categoryid=1078244;login=no;price=15to25;ptype=product_page;sku=0001871359124;visitorid=65540641651)
    2)CustomTargeting=(location=bottomleft;login=no;ptype=search;search=tinted_sunscreen;visitorid=29345565300)
    3)CustomTargeting=(location=bottomleft;categoryid=1030320;login=yes;ptype=category;visitorid=60395502519)

    input file all fields are not present in all the records( even order also not in sequence, it is varying record to record)

    Output should be as below all call columns should be present. if one column missing then it should be null. bez it is going to create table. I m looking in pig.

    location value
    brand
    customerld
    last
    category_id
    login
    price
    page_type
    sku
    vid
    serch
    tsource

    excepted output:

    omp_subheaderads,golds_gym,,,1078244,no,15to25,product_page,0001871359124,65540641651,,,,
    bottomleft,,,,,no,,search,,29345565300,tinted_sunscreen,,,
    bottomleft,,,,1030320,yes,,category,,60395502519,,,,

    Could you please help on the how to write pig script or UDF

    Thanks in advance.

    Thanks & Regards,
    Siva

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s