Data Quality Services Best Practices

Microsoft recently released the Data Quality Services Performance Best Practices White Paper (get it HERE) and there is a bunch of useful information to both be aware of and to take into consider when planning your implementation of DQS. This post is basically an overview of what the white paper discusses. I encourage everyone to take the time to download it and read through it thoroughly.

Expectations are Everything

If you have spent any time playing with DQS you should have at a minimum a reasonable basis of expectations in regards to the performance profile of DQS. Microsoft has officially published the following baseline when using their hardware recommendations:

Data Size Discovery Cleansing SSIS Cleansing Matching
10K 1 minute 1 minute 2-5 minutes 1 minute
100K 5-10 minutes 2-7 minutes 40-75 minutes 5-10 minutes
1M 1.5-3 hours 0.5-1.5 hours 7-12 hours 1-4 hours

* Reproduced from Data Quality Services Performance Best Practices (March 2012)

**Environment – Multi-core processor, 8-16GB memory, 1-2 physical disk, 1GB LAN, 64-Bit Win Server 2k8 R2

The numbers are pretty eye-opening especially when the SSIS Cleansing performance is considered. That being said let’s define some baseline expectations.

First, and the most straight-forward expectation to set, it that everything in DQS scales linearly both with the number of rows you are processing and with the number of domains you have in your knowledge base.

Getting that out of the way allows us to move to what is probably the biggest misconception in terms of expectations that I’ve come across in my interactions with folks at various community events. This misconception stems mostly from organizations dealing with data quality problems. If you have a data quality problem, throwing all the data in your data warehouse at DQS is not only NOT going to solve the problem its probably a bad idea.

The crux of the issue is that data quality is not just a technology problem. It has to involve people and process if the problem is going to be addressed and corrected. That being said, if you have a significant data quality problem in your data warehouse, implementing an iterative-based approached to cleaning up your data warehouse will get your further but again your simply addressing the symptoms and not fighting the disease.

The last expectation to set is in my opinion the most important. When you select DQS as a tool in your data quality toolset, you get out of it what you put into it. Simply put, the more time you spend developing your knowledge base the better it will perform and the more it will serve you. A well developed knowledge base is said to be mature and a mature knowledge base is one that credibly covers a significant portion of the population values which it is validating. This is a critical path in terms of DQS performance in that the more records DQS can find an exact match for (both good and common errors) in the knowledge base the fewer of the more expensive fuzzy-type matches it must perform to do its work.

With a nod made to setting expectations let’s move on to some actual best practices.

Hardware Best Practices

The most important thing to understand when it comes to hardware recommendation is that DQS is processor and memory intensive and that it can take advantage of as many cores as are available to SQL Server. It was designed to scale-up and the scale-up factor published by Microsoft is approximately 1.6 (for scalable operations: discovery, project-based cleansing and matching).

The specific recommendations are:

  • Minimum specs are a multi-core processor, 8-16GB memory, 1-2 physical disk and 1GB LAN
  • If you are working with 1 million records or more, the recommendation is to basically double the size of the processor, memory and disk recommendations.
  • Enable Hyper-Threading since it will improve scalable operations by a factor of 1.25
  • Disk should be either RAID-0 or RAID-5 and using solid state drives will double the speed of operation when compared to serial attached disks.
  • DQS using the SQL CLR extensively. The DQS AppDomain can be unloaded if your server has memory pressure. Watch the SQL Server ERRORLOG for DQS App Domain unload and reloads and add memory to system to resolve the issue.
  • DQS Matching makes extensive use of tempdb. The recommended size for tempdb when running a match against one million records or more is 200GB.
  • Be aware that is some situations DQS can move large amounts of data across the network. This typically happens when using the SSIS DQS Cleanse transform, on large cleanse or matching projects and when importing or exporting data.

Design and Implementation Best Practices

While hardware requirements play a large role in the performance you will get out of DQS there are other suggestions which are equally important and worthy of consideration as you dive into DQS.

First, in terms of overall knowledge base design. Composite Domains should only be used when necessary and should only include domains which are relevant to limit the amount of redundant information that is defined. Take into consider language and geography specific domains. Splitting out domains can result in smaller domains and simpler validation rules. While it makes working the knowledge more complicated it can have a fairly profound impact on performance.

Domain rules are also an important consideration. Simple business rules are implemented using T-SQL queries and have a minimal impact. Domain rules that utilize regular expression patterns bring the SQL CLR into play and although it looks like they tried to minimize the performance hit for that it’s still worth consideration as you design your knowledge base.

Secondary to that, make sure that you knowledge bases are mature and robust in that they not only contain a majority of the possible correct domain values but also common errors and or synonyms which may be found during the validation process. The benefits of a mature knowledge base are two-fold in that it impacts both cleansing and matching activities.

In both activities, if DQS is not able to find an exact match in the knowledge base, it uses a more expensive fuzzy-type matching algorithm to try and find similar matches. Exact matches are cheap and the more exact matches DQS finds the better it performs and the happier you are.

To help with the maturation process mentioned above, don’t forget to enrich your knowledge base by importing the new and error values identified as part of interactive cleansing projects. You do this in the knowledge management activity, using the Import project values option.

In terms of matching, optimization is primarily accomplished through the matching rule design. Microsoft highly recommends the inclusion of either a pre-requisite or “Exact” match field within each matching policy rule. Designing your rules in this way allows the DQS matching engine algorithms to work on smaller chunks of data. It’s also worth pointing out at this point that for both best results and performance and before any matching activity takes place, you should be running your data set through a cleansing activity. Subsequent to that, the matching should take place on the clean and standardized output.

The last point to consider, is how the implementation of the SSIS DQS Cleanse Transform affects your package design. When you run a cleansing project in the Data Quality Client, execution occurs in parallel and multiple processors are used to handle the processing. By design, this is not the case in SSIS. The DQS Cleanse component is design to run synchronously in the SSIS dataflow and uses only a single processor on the DQS server.

Before you grab your torches and pitch forks and start heading up to Redmond, this was done by design and for arguably an okay reason, in that batch cleansing is considered a background task and that it should minimally impact the higher priority interactive cleansing activities. As noble as that seems, Microsoft clearly understand that he issue this presents and provides a work-around in their whitepaper.

The work-around to this built-in limitation is to split the data flow using either a conditional split, a script task or a component like the balance data distributor to feed multiple DQS Cleanse transformations up to the number of processors on the box your are working on. You can then use a Union All to join the flow back together to continue your processing. Microsoft documents the performance gain as 2.5-3 times improvement using this work-around.

Parting Thoughts

Taking the time to set-up your DQS implementation will go along way towards maximizing the performance and utility provide by DQS. I hope this overview of the performance best practices developed by Microsoft was useful and helps get you on your way.

References

Data Quality Services Performance Best Practices – SQL Server Technical Article, Ittai Gilat, March 2012

Till next time!!

Chris

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s