Over the last three blog post we have walked through setting up, building and then using the knowledge discovery process to train our demo knowledge base. In this blog post we will actually put our knowledge base to use by using it to do data cleansing. We will look at two ways DQS can perform the cleansing task. The first is through a Data Quality Project within the Data Quality Client and the second is through the DQS Cleansing transform available in SSIS on SQL Server 2012.
If you need a recap of what we have walked through so far I will point you to my prior blog post.
So let’s dive in and do some data cleansing…
DQS Blog Series Index
Part 2: Building Out a Knowledge Base
Part 3: Knowledge Discovery in DQS
Part 4: Data Cleansing in DQS
Part 5 : Building a Matching Policy in DQS
Part 6: Matching Projects in DQS
Data Cleansing in the Data Quality Client
Before we get started launch the Data Quality Client and connect to the DQS Server. The process we will walk through consists of setting up a project and then completing a four step wizard. The wizard walks you through mapping, server cleansing, interactive cleansing and then exporting the results. It should be noted that cleansing in DQS is not an intrusive process. This means that the source data is never changed. It is left up to either the data steward or IT Pro to implement the changes identified by DQS.
Our focus for the remainder of this section will be on the Data Quality Projects section within the Data Quality Client. To get started click the “New Data Quality Project” button. On the Data Quality Projects screen you will see prompts for both Name, Description and the Knowledge Base to use for this project. Name the project “SportsTeamCleansingProject” and select “SportsTeamsKB” from the Knowledge Base drop-down.
Start a new Data Quality Project
Data Quality Project Definition
Notice that when you select the knowledge base, the details that include the domains and composite domains are displayed for you to review. I also want to point out that there is an option to select an activity below the knowledge base details. Cleansing is selected by default and the Matching option has been disabled. The Matching option is disabled because we have not defined a matching policy for our knowledge base (this will be addressed in subsequent blog posts). Once you have define the project parameters, click “Next” to begin the cleansing wizard.
The first step wizard is nearly identical to what we saw in the knowledge discovery process. You have the same two options for bring data in (SQL Server and Excel File). You also have the same restrictions that were discussed (For more info go here). The only difference is that you do not have the ability to add domains and composite domains to the knowledge base.
Data Cleansing – Select a Source
Let us go ahead and Map the data we will be cleansing for the purposes of the demo. Select “Excel File” at the Data Source prompt and the select the “SportsTeamsBadData.xls” (The sample file is available in the Resources section). Once you have specified the sample file, select “SportsTeams$” as the Worksheet. The sample file again has a header row so leave the “Use first row as header” option checked.
The options you have available are again similar to the knowledge discover process. You can add and remove rows for the purposes of mapping and preview your sample data as required. Go ahead and map the Team, League, Stadium and Capacity domains to the appropriate source columns in the sample data.
Once you are finished, if you have met the requirements any composite domains that are within the knowledge base the “View/Select Composite Domain” button will become active. Not to be repetitive, but the same rules as we saw in the knowledge discovery process apply. This means that in order for a composite domain to be used, all domains within the composite domain must be mapped to source columns in the your sample data.
View/Select Composite Domain
To move to the next step in the wizard, click “Next”. On the Cleansing step, click the “Start” button to load the data to the DQS server and kick-off the cleansing process. Once the processing has been completed, the Profiler tab provides you with the results of the DQS cleansing. The source statistics includes an aggregated count of: records processed, correct, corrected, suggested and invalid records. Additionally, you can view the results broken down to either the composite or single domain level as applicable. After you have reviewed the results of the cleansing process, click “Next” to move to the interactive cleansing step of the wizard.
Cleanse Step – Start
Cleanse Step – Finish
On the “Perform interactive data cleansing” screen, you start with a list of both single and composite domains on the left and the results for the selected domain on the right. Starting with the SportsTeams composite domain, you will notice that we have five tabs: Suggested, New, Invalid, Corrected and Correct.
Interactive Cleansing – Suggested Tab
The Suggested tab contains records identified by the DQS Server algorithms that fell below the confidence interval configured for automatic correction but was above the minimum threshold for significance that requires a data steward to review the record. In the case of our demo, we have one correct that affects two records that have been suggested based on the DQS Speller. As we can see, DQS proposes we change “Bucsaneers” to “Tampa Bay Buccaneers” and that the confidence in the match is at 75%.
We are also given some additional information which is useful when working in the project. For each record we can see the reason DQS proposed the correction as well as the raw source record. If multiple records existed with the same value being suggested to change, they would be grouped and visible in the lower grid.
You have two options for approving or rejecting the suggestions. The first is to approve the entire group or records (upper grid) the second is to approve individual records (lower grid). If you approve the record, the record will be removed from the Suggested tab to the Corrected tab. If you reject the suggestion, the record will be moved to the Invalid tab.
Interactive Cleansing – New Tab
The New tab contains records which represent new values in the domain. Looking at the results of our demo we see one new value is proposed. Similar to the suggestion tab, you can approve or reject these changes.
Interactive Cleansing – Invalid Tab
The Invalid tab highlights records that failed either domain or composite rule validation and displays the reason for the failure. If a record fails a single domain rule, then composite domain rules are not evaluated. In the demo we see that several records failed the capacity domain rule and the league acronym rule. We can override these failures and mark the records as correct if applicable.
Interactive Cleansing – Corrected Tab
The Corrected tab displays records that DQS were able to automatically correct according to the configuration and algorithms used by the DQS Server. In addition records that are accepted interactively are shown here. Our demo file has seven records that were corrected either as a result of the DQS speller or because of domain rules. The reason column provides info on why each record was corrected. The functionality is also similar to the Invalid tab in that we have an override option in which we can we can reject the corrected record.
Interactive Cleansing – Correct Tab
The final tab contains Correct records. These records are those who pass all applicable rules contained in the knowledge and whose domain values are known. If necessary, you can reject the match in the same way as records on the Corrected tab.
For the purposes of the demo, accept the suggested changes and decline the proposed new record. When you have finished interactively managing the records, click “Next” to advance to the final step of the wizard.
The final step to the cleansing process is exporting the results. In much the same way you have for loading data in, the two options you have for exporting is SQL Server and CSV file. Notice that you can preview the output data on the left. The preview contains the source data, the cleaned data and the cleansing info. Once you select the target for your export there are a couple of other options you have available.
Cleansing Project Export
The first option allows you to standardize the output. This will make use of the standardization options that you configured when you set-up your knowledge base (Capitalization, Number Format, etc.). The second allows you to export either the data only or the data and cleansing info as required.
To illustrate what happens, lets export our results to a csv file called “SportsTeamsClean.csv”. For the destination type, select “CSV File” from the drop-down. Next, navigate to your temp directory (or another directory of your choosing) and create the export file. Leave the “Standardize Output” option checked and make sure that the “Data and Cleansing Info” option is selected. Click the “Export” button to build the export file.
Let’s take a second to look over the export file. Prior to building the exported we select the option to include “Data and Cleansing Info”. The output as a result of this option contains five columns for each of our mapped domains. The columns are:
1. XXX_Source – The original source domain/column value
2. XXX_Output – The clean and standardized domain/column value
3. XXX_Reason – The reason the domain/column value was either valid or invalid
4. XXX_Confidence – The domain/column confidence percentage returned by the DQS server algorithms
5. XXX_Status – The domain/column processing status (i.e. Correct, New, Invalid, etc.)
In addition to the detailed processing info, a record status is returned which indicates the status of the record as a whole. If this level of data is not required for your purposes, you can select the “Data Only” option on the export screen. This option returns only the valid and cleansed records and leaves out the reference back to the source data as well as the DQS processing information.
Once you have completed your review of the export, click “Finish” to complete the Data Quality Project.
Before we look at DQS Cleansing in SSIS, let’s look at the configuration settings that have a role in cleansing process. Under the “Administration” section, select “Configuration”. On the General Setting tab, you will notice settings that allow you to control the values used for both the minimum score for suggestions and for auto correction. These are global settings and changing them impacts the behavior of all your cleansing projects.
DQS Admin Configuration
Data Cleansing in SSIS
The DQS Cleansing transform is provide as a standard component in SSIS 2012. The transform provides access to all the same functionality as is available within the Data Quality Project. Configuring the component is straight-forward if you are familiar with SSIS in general.
The first-step is to specify a connection to the DQS server. This transform requires a DQS connection manager. The only requirement to set this up is the DQS Server name. Once you have established the connection, you select the Data Quality Knowledge Base to use during the cleansing exercise. For demo purposes, select “SportsTeamsKB”. After you select the knowledge base, you will be able to preview the available domains.
DQS Cleanse Transform – Connection Manager
The Mapping tab is where you wire up the input columns to the domains in the knowledge. Select the from the available inputs (in our case, Team, League, Stadium and Capacity) and then map each column to the appropriate domain. You will notice that unlike in the Data Quality Project you have the option to rename the output columns as your needs and requirements may dictate.
DQS Cleanse Transform – Mapping
Finally, the Advanced tab has options which allow you to Standardize the output according the domain configuration, control whether DQS processing information is passed to the output and enable record level appended data when using a reference data provider.
DQS Cleanse Transform – Advanced
The DQS Cleansing transform has a single output. You can use a conditional split on the Record Status column to handle each status independently or as you see fit. The final point to make before we wrap up is to understand how this transform actually works.
Sample Implementation of DQS Cleanse Component
When you use this component, your data is being batched and sent off to the DQS server for processing. If you are in turn, using a reference data provider in the Azure Data Market, your data would then been sent off to the cloud. Understanding this is important so that your performance expectations are in line. You CAN NOT compare the performance of this transform to other transforms.
SportsTeamsBadData.xls – Sample dirty data to use for cleansing
To summarize, we covered the cleansing function that is available in both a DQS Data Quality Project and in the SSIS DQS Cleansing Transform. Continuing, we will spend time over the next couple of blog posts to set-up a matching policy for our knowledge and then build out a Data Quality Project to highlight the matching capabilities within Data Quality Services.
Till next time!