Making the Case for Statistical Semantic Search

As we sit on the cusp of SQL Server 2014, it seems a little odd to be writing a blog whose objective is to introduce File Table and Semantic Search. While both of these feature were new in SQL Server 2012 and both received quite a bit of attention I have found recently that they are still either misunderstood or simply overlooked. Regardless, either case is unfortunate because both features are very powerful and open up a number of possibilities in handling what is commonly referred to as unstructured data within SQL Server.

Semantic Shhhhmantics – What’s the deal?

If you’ve heard it once, you’ve heard it a thousand times, the term “unstructured” data is thrown around pretty liberally these days. Stepping back from the applicability of the term “unstructured” to most data, the fact that only about 20% of data is contained in relational databases means that whether unstructured or structured most data exists outside SQL Server (or your choice of RDMS) and in the wild.

So how do you manage this data? How do you incorporate it into your line of business applications? What about search? The common answer to these questions is the implementation of an enterprise search appliance which is no small feat depending on the expertise and resources within your organization.

A better solution and one that was delivered with SQL Server 2012, is the expansion of existing features (namely FileStream and Full-Text) to allow this data to be proactively managed and integrated for rich client experience all.

The File Table

While this post is not explicitly about the new FileTable feature, I would be remiss to not mention it since it’s an integral part to what makes Semantic Search so powerful. FileTable is essentially a table of files and/or directories that is built on top of FileStream. What makes this unique is that the FileTable supports Win32 APIs for file and directory maintenance and handle the database storage transparently. More simply put, your applications can drop files into an network share and the files are persisted (and managed) within a SQL Server table which is not only much easier that dealing with BLOBs or VARBINARY(MAX) columns it opens up a whole realm of possibilities to extract meaningful information from data that was previously just taking up space in your database..

Semantic Search vs. Full-Text Search

So now we have a way to easily store everything from pictures and videos, to PowerPoint, PDF and Word Documents within a SQL Server table. How do we go about making this new capability meaningful? The first and most obvious answer answer is to apply full-text index to make is searchable and while this is a good first step this is where most of the confusion that I encounter comes in, what is the difference between a Full-Text Search and (Statistical) Semantic Search.

Let’s dig into this. The core function of full-text search is to find key-words within a document or corpus of documents. Semantic search actually builds on this by adding the ability to extract meaning. Specifically there are three scenarios that are now possible:

  1. Extract key-phrases from a document
  2. Find similar or related documents
  3. Identify the key phrases that make two documents similar

For the remainder of this posts we will dig into examples of each.

Examples

In the following examples. I have not only enabled FileStream and created a FileTable, I have also set-up and configured my machine for Semantic Search (Office and Adobe IFilters). I have covered the steps required to get Semantic Search configured in a previous blog post (HERE). My table is simply called Documents and I’ve done nothing more than load a bunch of technical presentations (PowerPoint) and papers (Word & PDF).

Extracting Key-Phrases

In this first example, we can use text mining algorithms in semantic search to extract out a list of key phrases or relevant terms. While full-text allows us to search for terms (i.e. find all documents that contain the word multidimensional), semantic search will allow us to find all document where that term is not only present but is a key-term in document.

To demonstrate this, I can run a full-text search using CONTAINS and I find that in my corpus there are 8 documents that have the word multidimensional somewhere within their text. While this is certainly useful in many scenarios, it didn’t necessarily return documents about multidimensional models or modeling, only those that contained the word. Semantic Search takes it one step further by extracting keyword with a relevance score. Using the semantickeyphrasetable and filtering on the confidence score returned by the search, I find that of those 8 documents only 2 are truly relevant to multidimensional.

image image

Beyond searching for a document on a specific keyword, we can use the same semantickeyphrasetable function to return the set of meaningful keywords for one or more documents. In essence this gives us something of a tag cloud capability for classifying documents. In the query below, I used the function to extract the top 15 terms that describe the Securing the Tabular BI Semantic Model whitepaper.

image

Finding Similar/Related Documents

In the previous example, we used semantic search to find documents based on some keyword or search term. In this example we take it a step further by searching the corpus to find documents that are similar or are related in terms of content.

The semanticsimilaritytable function accepts a stream id for the source document and returns a list of of matched document stream ids with a score showing the strength of the match. In the query below, I searched my library of technical material to find documents that where similar to Identifying and Resolving MDX Bottlenecks SSAS whitepaper. The search found 10 matches that ranged in score from approximately a .40 to .15 with a higher score indicating a better match.

image

Identifying Key Similarities

Continuing the example above, if we were interested in knowing why our MDX whitepaper was matched to the SSAS Distinct Count whitepaper we could use the built in semanticsimilaritydetailstable function to return the set of key phrases that are common across both documents with their relevance scores.

In the example below, the query result in the identification of 36 common key phrases.

image

Conclusion

In this post, we looked at Statistical Semantic Search for SQL Server 2012 starting with the business case  before focusing in on three primary uses cases that were all made possible by FileStream and the Full-Text Index features you were most likely already knew about.

If this is your first introduction to Semantic Search, I hope it proved meaningful in helping you to understand the strength and power available in this out-of-the-box functionality. If you were already familiar with it and the benefit of Semantic Search was unclear, I hope know it is more clear.

Till next time!

Chris

One thought on “Making the Case for Statistical Semantic Search

  1. Chris, Thank you very much for this blog post. I referenced your SS setup blog and this blog to help me get a test up and running. I seem to have found some issues with the Semantic Search, and I would love your insight into them if possible. . It appears that even though some technical key words (ie. F16 or F-16) get full text indexed, and return results using FT search with CONTAINSTABLE, however, those technical key words are no where to be found when looking into the SEMANTICKEYPHRASETABLE for the same records. There is an editable STOPWORD table, is there also an editable KEEPWORD table? It is though semantic search only indexes words that it deem important….and leaves out a lot of technical terms….it would be nice to be able to train it. Lastly, is there a way to search the SEMANTICKEYPHRASETABLE for multiple key words, inflection, thesaurus, etc as you can in the Full Text side?

Leave a comment