In this article I’m going to show how to setup a machine learning algorithm to classify documents using a Big Data analytics cluster running in Windows Azure (HDInsight).
Azure HDInsight is an easy and flexible way to provision an analytics cluster in the cloud for doing Big Data analysis. A cluster consists of 1 main node, 1 job tracker node and any number of compute nodes (workers), that are doing the heavy lifting. The cluster is backed by the Open Source Big Data platform called Hadoop. Hadoop is written in Java and it is an extensible open platform for analyzing large sets of data in a fast and resilient fashion. It’s built on top of Hadoop Distributed File System (HDFS) which is a distributed file system similar to Google’s GFS and the so-called Map-Reduce algorithm which distributes (map) work among any number of cluster nodes and collects (reduce) the result to a single output (usually stored in HDFS).
Mahout is an extension library built on top of Hadoop which enables Machine Learning algorithms to run in the cluster.
I’m assuming you have a Windows Azure subscription, if not it is possible to get free access to Azure with your MSDN subscription, which you should be able to obtain if you are a student or you can get a 1-month free trial here.
Setting up the Windows Azure HDInsight cluster
As of August 2013 you need to sign up for the preview features of Azure as HDInsight is not part of the standard Azure package, this may change in the near future when HDInsight is completed. The steps to do so are outlined here, but I’ll repeat them:
· Try to create an HDInsight data service in the Windows Azure Management Portal, it will be grayed out and instead you should see a link to ‘preview features’. Follow that link and sign up for HDInsight.
After you sign up and get a completed notification, you are ready to provision an HDInsight cluster, but first we need to setup a dedicated storage account (which is not geo-replicated)
· Create a new storage account, making sure to uncheck the georeplicated box
Now we provision our cluster
· Make a new provisioning and select HDInsight cluster under the data services category
· Pick the storage account you created in the previous step
· Select a size for the cluster (small will do just fine) and a strong password
· Click create
A note here: I’ve encountered an error when trying this step, where it fails on the second operation. I just wait a couple of minutes and retry and things work out fine. I think this is because I’m too fast after enabling the preview feature to the time I provision the cluster, but I’m not certain.
Logging onto the cluster
After the cluster is created you can find it in its own category in the Windows Azure Management Portal under HDInsight. Select your cluster and click ‘Manage Cluster’ this will bring you into an HDInsight specific portal which will look something like this:
Click the ‘Remote Desktop’ tile which will open an *.rdp file to connect to your main node.
Running Hadoop command
Once you log onto the main cluster node you should find a shortcut on the desktop by the name ‘Hadoop Command Line’. Click it to start a Hadoop prompt, which will show you some high level help:
Hadoop is the engine behind our machine learning example, so it makes sense to explore it a bit, however, I won’t go into details with it because my knowledge is limited at this point.
The fundamental functionality of Hadoop has to do with interacting with the file system. Keep in mind that in the cluster you have a completely separate file system (HDFS) which is not visible to Windows (it’s backed by the storage account you created before provisioning the cluster). Hadoop exposes functionality to work with the HDFS file system, you get to them using:
Hadoop fs -<command> <args>
File system commands have Unix names (like ls, rm, cat, etc). So to upload a file from the local disk to HDFS, you simply execute:
Hadoop fs –put hdfs_folder/hdfs_file local_folder\local_file
This will upload a local file (or folder) to HDFS and the data is now immediately available to the compute nodes. For more info about file system commands check the documentation.
Notice: The path formats differ between local path and HDFS path. The HDFS format is Unix while the local file system is of course Windows path format.
Another thing to keep in mind is that for performance reasons the HDFS file system is immutable, that means once data is placed on the file system you cannot edit it. If you did something wrong, you would need to delete the files and then recreate them.
The last thing we need to know about Hadoop is that you run command using the ‘jar’ command, which runs functionality exposed by Java packages in Hadoop. One example could be to run a word counting job to count occurrences of words in a text file. Create a new test file named mytextfile.txt and type something into it, then run the following steps:
Hadoop fs –put mytextfile.txt test/input.txt
Hadoop jar hadoop-examples.jar wordcount test/input.txt output
Hadoop fs –cat output/*
This will create a job that runs in the cluster, scanning the ‘mytextfile.txt’ for all words and returns a list of words and their respective number of occurrences in the file. In my case the input and output looks like this:
The way to interpret the output is that the word ‘a’ occurred once, ‘hello’ occurred twice, etc. The output of this command is a perfect example of a ‘vector’ where each word is a dimension in the vector and the occurrence count is the value in that dimension of the vector. As we shall see later, vectors are a recurring theme in Hadoop – large data sets are almost always represented as vectors and matrices.
This is of course a trivial example and it is complete overkill to use a data analytics cluster to run this job. However, the same principles apply to Big Data analysis it just takes a lot longer, and that’s why it’s nice to have a cluster helping you out.
Machine Learning in Azure
Time to move on to Machine Learning! The first step is to install Mahout onto the cluster. Go here, then choose the closest mirror to you and open the folder 0.8. You should see the following
Download 'mahout-distribution-0.8.zip', copy it to the clusters main node using RDP (Windows copy/paste over remote desktop works), unzip it and place the files under 'C:\apps\dist\mahout-0.8'.
Now that is in place, we can start using Mahout for doing Machine Learning. The example I’m going to run through here is that of a ‘classification’ model. This is a model that takes incoming documents (think of the in general, a document could be an email, a log file, a web-request, etc.) and tries to predict something about this document. Now this ‘something’ is entirely up to us to decide upon. For emails the classic example is to predict if an email is spam or not (we call that spam or ham). For web-requests we could try to classify if the request is legitimate or a denial-of-service (DoS) request sent from an attacker. Whatever it is we are trying to predict, the general idea is that we must provide the algorithm with a training data set in which we have told it for each document what the outcome was (e.g. spam or ham), such that the algorithm can learn from this fact.
But what do we get out of that exactly? One document on its own is not going to be very helpful, but the premise of machine learning is that you feed the model tons of training data (this is where machine learning ties into Big Data), and by churning through all this data, the model will detect hidden patterns in the data that can be used to predict the outcome of future documents (without knowing if they are actually spam or not ahead of time).
So how does this work? Let’s try to break it down. For reference I have used link  in the bottom to get started on this, but that introduction is based on Mahout 0.5 which is outdated by now, so I would not suggest following it.
We need to perform the following steps for making a classification algorithm:
1. Upload the input data that we want to categorize in HDFS
2. Prepare the data by converting it to input vectors (more on that later)
3. Train the machine learning algorithm
4. Test the machine learning algorithm
Creating the data set
We need a data set to train on, you could try using your own, but I would suggest running through this exercise with the sample provided by Spam Assassin as I’m doing it here.
On your local machine, create a new directory named ‘dataset’. Download the following files: ham and spam and unzip them under ‘dataset\ham’ and ‘dataset\spam’ respectively (These are ‘*.tar.bz2’ files so you need an unzip tool like Winzip or 7-Zip to extract them unfortunately). Now zip up the entire ‘dataset’ folder and copy it onto the main cluster node (use RDP again). Remote into the main node and create a ‘work’ folder under ‘c:\apps\dist\mahout-0.8\’ and extract your data files there.
Now we need to upload the data set to HDFS. Start a ‘Hadoop Command Line’ prompt and enter:
Hadoop fs –put dataset dataset
This can take a while. After it completes, your data is now stored in HDFS in a folder named ‘dataset’.
Converting data set to input vectors
Now we need to take all of these input text files and convert them to something Mahout’s training algorithms can use – they need to be ‘vectorized’. In the ‘Hadoop Command Line’ prompt execute the following commands:
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i /user/admin/dataset -o /user/admin/dataset-seq -ow
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i /user/admin/dataset-seq -o /user/admin/dataset-vectors -lnorm -nv -wt tfidf
This converts the input text files to Hadoop vectors. So what happened here? We are telling Hadoop to load the Mahout package called ‘mahout-examples-0.8-job.jar’ (this holds the MapReduce jobs required for running the examples packaged with Mahout), then we tell it to make use of the MahoutDriver Java class which sets up everything to run Mahout commands (this is my interpretation because I really don’t know what goes on behind the scenes…) then finally we call two Mahout commands, seqdirectory and seq2sparse with some parameters pointing to our dataset.
Now the next part is splitting the resulting vectors into two sets, one we use for training the algorithm, the other one is for testing. We hide the testing vectors from the model and only after it has been fully trained, we let it try to classify the testing vectors without telling it what is ham or spam (don’t worry Mahout will take care of this for us later on).
Splitting is done using:
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver split -i /user/admin/dataset-vectors/tfidf-vectors --trainingOutput /user/admin/train-vectors --testOutput /user/admin/test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
This splits 40% into testing vectors and the remaining 60% into training vectors. This concludes the preparation step.
Training the Machine Learning model
This is probably the easiest step, to train a machine learning model using the Naïve Bayes model run the following command:
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver trainnb -i /user/admin/train-vectors -el -o /user/admin/model -li /user/admin/labelindex -ow
Testing the Machine Learning model
Also an easy step, run:
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver testnb -i /user/admin/test-vectors -m /user/admin/model -l /user/admin/labelindex -ow -o /user/admin/testing
After the job completes it will spit out the result of the test, indication how accurate the model was in prediction spam, and details on classification errors.
There are a couple of gotchas when working with Hadoop/Mahout. Some that I have found are that you need to provide absolute paths (like /user/admin/dataset) to the input and output data, otherwise you will run into problems where Mahout stores intermediate files under the wrong user directory because it is impersonating another user (hdp).
I’ve also found that if your input documents are very large in size (10+ MB), you can run into ’heap space’ exceptions. It should not be a problem with the sample I provided here, but if you train on your own data, it could be an issue. I simply deleted the input samples that were very large to avoid this problem.
Finally, the algorithm I’ve been using is Naïve Bayes, and it is a very simple Machine Learning algorithm, which usually does not give high accuracy on predictions (say 95+%). The field of Machine Learning has develop better algorithms, but unfortunately Mahout is missing support for some very powerful algorithms like Support Vector Machines (SVM) which are very popular in Machine Learning.
This example demonstrated how to apply Machine Learning to classify spam, but as I mentioned previously you can apply it to any text documents just by replacing the data set with your own as long as you can provide sufficient data samples and accurately categorize the training data, you should be able to train a machine learning model.