Installing Mahout for HDInsight on Windows Server

I am passionate when it comes to analytics, data mining and machine learning and I think most organizations do too little when it comes to this arena. That’s why one of my favorite parts of the Hadoop ecosystem is Mahout.

Mahout is a scalable machine learning library that includes multiple out of the box machine learning and data mining algorithms including clustering, classification, collaborative filtering and frequent pattern mining.

If you are using HDInsight in the cloud Mahout comes pre-installed for your use. Unfortunately, if you are running a local HDInsight instance on Windows Server you must deploy Mahout on your own.

While this may sound like a daunting task the fortunate thing is that underneath the covers of HDInsight is a standard instance of Hadoop. Let’s take a look at what it takes to get Mahout up and running.

Step-by-Step

1. Download the zipped Mahout 0.7 distribution from the Apache website: http://www.apache.org/dyn/closer.cgi/mahout/

2. Extract the contents of the zip file to c:\Hadoop and rename the folder mahout-0.7 for simplicity

3. Now we are going to test the installation using the Simple Recommendation Engine demo: http://www.windowsazure.com/en-us/manage/services/hdinsight/recommendation-engine-using-mahout/

4. Follow the lab to generate the required files for lab or for expediency you can download them here:

mUser.txt

user.txt

5. Once you have download the files, place them in the c:\temp\ directory on your HDInsight instance.

6. Open the Hadoop Command Line console by clicking the link either found on the desktop or the on the start menu.

7. The first step as directed by the lab is to copy the test files from the local file system into HDFS. Use the following commands to deploy both text files to HDFS:

hadoop dfs -copyFromLocal c:\temp\mInput.txt input\mInput.txt

hadoop dfs -copyFromLocal c:\temp\users.txt input\users.txt

8. Browse and verify that the files now exists within HDFS:

hadoop fs -ls input/

image

 

9. I won?t explain what the sample job is doing since the lab referenced above does a good job of explaining that. We will simply use the sample job to verify the Mahout distribution is configured and ready for use:

hadoop jar C:\Hadoop\mahout-0.7\mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE

--input=input/mInput.txt --output=output --usersFile=input/users.txt

image 

10. The job will take several minutes to run to completion. When the job completes lets dump the results to a text file in the temp directory:

hadoop fs -copyToLocal output/part-r-00000 c:\temp\output.txt

image

11. Optionally, to clean-up the files used for the test use the following commands to remove the output and temp directories:

hadoop fs -rmr -skipTrash temp

hadoop fs -rmr -skipTrash output

That’s it. You Hadoop instance now has Mahout support!

Till next time!

Chris

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s