In Part 1 of this blog series we built a foundation by introducing the various techniques that can be used to generate recommendations for products or items to your users. In this post, we begin looking at the Mahout as a platform for building a recommender including setting up a data model, common methods for calculating similarity and finally the algorithms used to generate recommendations
Understanding Recommendations in Mahout
Mahout is a machine learning library of algorithms that grew out of a projected called Taste. It supports both non-distributed or real-time (non-Hadoop) and distributed or batch processing (Hadoop). According to the documentation from the Mahout website, four primary use-cases are supported:
- Collaborative Filtering (i.e. Recommendation mining using user behavior)
- Clustering (i.e. grouping similar documents)
- Classification (i.e. assigning uncategorized documents to predefined categories)
- Frequent Itemset Mining (i.e. Market Basket Analysis)
We will focus strictly on the first use-case Collaborative Filtering in this blog series and will start by looking at generating recommendations based on Beer Ratings from the BeerAdvocate data set (http://snap.stanford.edu/data/web-BeerAdvocate.html).
Collaborative Filtering in Mahout uses a very simple model referred to simply as preferences. Both user-to-user and item-to-item recommendations are generated use this same basic data model which consists of User ID, Item ID and a Preference Score.
- User ID – an integer key value for the User
- Item ID – an integer key value for the Item
- Preference Score – an increasing is better decimal value that represents the user’s implicit or explicit preference for the item
These three data points are loaded into a DataModel object. The DataModel object is a collection of Users each of which has a collection of Preferences. Mahout supports a number of different built-in methods for loading the data model including file-based and JDBC. For our purposes, data values are loaded by way of CSV data file.
Preparing the data requires a quick utility app, which I wrote in C# to process the dataset and extract out a list of Beers, a list of Users and a matrix of user ratings (preferences). The dataset contains a unique beer id integer which can be used as the Item ID. User’s however are identified by username so it is necessary to create and assign an unique User ID to each user. The full scope of the utility is beyond the scope of this blog, so the source code is provided below.
Collaborative filtering functions as if the item is a black box meaning that it knows nothing about an item or user’s attributes. The result of this is that recommendations are generated based solely on the user’s preference for the item. For this to work, we must determine how similar two users or two items are to one another in order to generated meaning recommendations.
Luckily for us, Mahout supports multiple statistical similarity metrics out of the box. At a high level, similarity metrics determine the either the distance between users or items or the likelihood that their preference scores will be similar. For now, we will save you the math behind these calculations and introduce only the options that are available in Mahout.
|Pearson Correlation||Measures the tendency for two users or items to move proportionally together.|
|Euclidean Distance||Measures the distance between two users or items|
|Spearman Correlation||Variant of the Pearson Correlation that uses the relative rank of the preference value.|
|Cosine||Also measures the distance between two users or items|
|Tanimoto Coefficient||Ignores preference values and looks at the intersection of item ratings between two users.|
|Log-Likelihood||Similar to Tanimoto in that preference values re ignored. It measures the how unlikely it is for two uses to overlap.|
A Beautiful Day in the Neighborhood
When you are working with a user-to-user recommender, the concept of a neighborhood is introduced. These neighborhoods group together similar users (as identified by your selected similarity metric) using a nearest-neighbor or threshold strategy. The effect this has from a user perspective is that instead of needing to look at the entire population of users to generate recommendations, the algorithm can look only at the neighborhood of users. This significantly improves performance in a user-to-user scenario.
Collaborative Filtering Algorithm
The process of generating user-to-user recommendations is best illustrated in the following pseudo-code:
for each item i that u has no preference
for each user v that has a preference for I
compute similarity s between u and v
calculate running average of v‘s preference for i, weighted by s
return top ranked (weighted average) i
For small datasets this algorithm may be fine, but the glaring issue is that each user is being compared to every other user which in datasets consisting of millions of users can be problematic. This is where the concept of the neighborhood becomes useful since instead of looking at all users, only the subset of similar users in the neighborhood are consider.
Generating Recommendations with Mahout
While it’s possible to work with Mahout in non-distributive mode using a bit of Java code, we will focus strictly on using Mahout in distributive mode on HDInsight. Before getting started, you need to move the beerratings.csv file into HDFS and create an output directory for the batch job.
Running the recommendation job uses the built-in recommender jobs found in the mahout-core-0.7-job.jar. To run the job you will need to specify the following parameters:
- -s : Your chosen similarity metric
- –input: HDFS path to the input file containing the user/item/preference data
- –output: HDFS path for the job results
Using the most common recommender job (org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) the command to kick-off the job is below:
hadoop jar C:\Hadoop\mahout-0.7\mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_PEARSON_CORRELATION --input=/user/Administrator/Beer/Input/beerratings.csv --output=/user/Administrator/Beer/Output
Once the job starts, a series of MapReduce jobs are run produce a series of intermediate outputs including similarity and preference matrices among others. You can view the outputs in your default HDFS temp directory which for me happens to be /user/Administrator/temp.
When the job finishes, which on my machine (single node 4-core HDInsight instance, VM) took about 45 minutes, the output of the job are the top 10 beer recommendation for each user in the data set. You can control which users you generate recommendations for by specifying a user file when kicking-off the job.
You can download the results from HDFS using the copyToLocal option. After downloading the results we can analyzing the entire dataset and output to find that for a user named ActualAir who rated the beers on the left, the beers on the right were recommended based on his or her similarity to other users.
That’s it! You’ve created your first simple recommendation engine. To review, we looked at Mahout, it’s data requirements, similarity metrics and how the collaborative filtering user-to-user algorithm works to produce meaning recommendations. In future posts we will look at how you can evaluate and fine-tune Mahout algorithms as well as techniques for integrating a Mahout recommendation engine into an existing system.
Till next time!