In this article
I’m going to show how to setup a machine learning algorithm to classify documents
using a Big Data analytics cluster running in Windows Azure (HDInsight).
Azure HDInsight is
an easy and flexible way to provision an analytics cluster in the cloud for
doing Big Data analysis. A cluster consists of 1 main node, 1 job tracker node
and any number of compute nodes (workers), that are doing the heavy lifting. The
cluster is backed by the Open Source Big Data platform called Hadoop. Hadoop is written in Java and it is an extensible
open platform for analyzing large sets of data in a fast and resilient fashion.
It’s built on top of Hadoop Distributed File System (HDFS) which is a
distributed file system similar to Google’s GFS and the so-called Map-Reduce
algorithm which distributes (map) work among any number of cluster nodes and
collects (reduce) the result to a single output (usually stored in HDFS).
Mahout is an extension library built on top of Hadoop which enables
Machine Learning algorithms to run in the cluster.
I’m assuming you
have a Windows Azure subscription, if not it is possible to get free access to
Azure with your MSDN subscription, which you should be able to obtain if you
are a student or you can get a 1-month free trial here.
Setting up the Windows
Azure HDInsight cluster
As of August 2013
you need to sign up for the preview features of Azure as HDInsight is not part
of the standard Azure package, this may change in the near future when
HDInsight is completed. The steps to do so are outlined here, but I’ll repeat
them:
·
Try to create an HDInsight data service in the Windows
Azure Management Portal, it will be grayed out and instead you should see a
link to ‘preview features’. Follow that link and sign up for HDInsight.
After you sign up
and get a completed notification, you are ready to provision an HDInsight
cluster, but first we need to setup a dedicated storage account (which is not geo-replicated)
·
Create a new storage account, making sure to uncheck
the georeplicated box
Now we provision
our cluster
·
Make a new provisioning and select HDInsight cluster
under the data services category
·
Pick the storage account you created in the previous
step
·
Select a size for the cluster (small will do just
fine) and a strong password
·
Click create
A note here: I’ve
encountered an error when trying this step, where it fails on the second
operation. I just wait a couple of minutes and retry and things work out fine.
I think this is because I’m too fast after enabling the preview feature to the
time I provision the cluster, but I’m not certain.
Logging onto the
cluster
After the cluster
is created you can find it in its own category in the Windows Azure Management
Portal under HDInsight. Select your cluster and click ‘Manage Cluster’ this
will bring you into an HDInsight specific portal which will look something like
this:
Click the ‘Remote
Desktop’ tile which will open an *.rdp file to connect to your main node.
Running Hadoop command
Once you log onto
the main cluster node you should find a shortcut on the desktop by the name
‘Hadoop Command Line’. Click it to start a Hadoop prompt, which will show you
some high level help:
Hadoop is the
engine behind our machine learning example, so it makes sense to explore it a
bit, however, I won’t go into details with it because my knowledge is limited
at this point.
The fundamental
functionality of Hadoop has to do with interacting with the file system. Keep
in mind that in the cluster you have a completely separate file system (HDFS)
which is not visible to Windows (it’s backed by the storage account you created
before provisioning the cluster). Hadoop exposes functionality to work with the
HDFS file system, you get to them using:
Hadoop fs
-<command> <args>
File system
commands have Unix names (like ls, rm, cat, etc). So to upload a file from the
local disk to HDFS, you simply execute:
Hadoop fs –put
hdfs_folder/hdfs_file local_folder\local_file
This will upload a
local file (or folder) to HDFS and the data is now immediately available to the
compute nodes. For more info about file system commands check the documentation.
Notice: The path
formats differ between local path and HDFS path. The HDFS format is Unix while
the local file system is of course Windows path format.
Another thing to
keep in mind is that for performance reasons the HDFS file system is immutable,
that means once data is placed on the file system you cannot edit it. If you
did something wrong, you would need to delete the files and then recreate them.
The last thing we
need to know about Hadoop is that you run command using the ‘jar’ command,
which runs functionality exposed by Java packages in Hadoop. One example could
be to run a word counting job to count occurrences of words in a text file.
Create a new test file named mytextfile.txt and type something into it, then
run the following steps:
Hadoop fs –put
mytextfile.txt test/input.txt
Hadoop jar hadoop-examples.jar wordcount
test/input.txt output
Hadoop fs –cat output/*
This will create a
job that runs in the cluster, scanning the ‘mytextfile.txt’ for all words and
returns a list of words and their respective number of occurrences in the file.
In my case the input and output looks like this:
The way to
interpret the output is that the word ‘a’ occurred once, ‘hello’ occurred
twice, etc. The output of this command is a perfect example of a ‘vector’ where
each word is a dimension in the vector and the occurrence count is the value in
that dimension of the vector. As we shall see later, vectors are a recurring
theme in Hadoop – large data sets are almost always represented as vectors and
matrices.
This is of course
a trivial example and it is complete overkill to use a data analytics cluster
to run this job. However, the same principles apply to Big Data analysis it
just takes a lot longer, and that’s why it’s nice to have a cluster helping you
out.
Machine Learning
in Azure
Time to move on to
Machine Learning! The first step is to install Mahout onto the cluster. Go here, then choose the
closest mirror to you and open the folder 0.8. You should see the following
Download 'mahout-distribution-0.8.zip',
copy it to the clusters main node using RDP (Windows copy/paste over remote
desktop works), unzip it and place the files under 'C:\apps\dist\mahout-0.8'.
Now that is in
place, we can start using Mahout for doing Machine Learning. The example I’m
going to run through here is that of a ‘classification’ model. This is a model
that takes incoming documents (think of the in general, a document could be an
email, a log file, a web-request, etc.) and tries to predict something about this document. Now this
‘something’ is entirely up to us to decide upon. For emails the classic example
is to predict if an email is spam or not (we call that spam or ham). For
web-requests we could try to classify if the request is legitimate or a
denial-of-service (DoS) request sent from an attacker. Whatever it is we are
trying to predict, the general idea is that we must provide the algorithm with a
training data set in which we have told it for each document what the outcome
was (e.g. spam or ham), such that the algorithm can learn from this fact.
But what do we get
out of that exactly? One document on its own is not going to be very helpful,
but the premise of machine learning is that you feed the model tons of training data (this is where
machine learning ties into Big Data), and by churning through all this data,
the model will detect hidden patterns in the data that can be used to predict
the outcome of future documents (without knowing if they are actually spam or
not ahead of time).
So how does this
work? Let’s try to break it down. For reference I have used link [6] in the
bottom to get started on this, but that introduction is based on Mahout 0.5
which is outdated by now, so I would not suggest following it.
We need to perform
the following steps for making a classification algorithm:
1. Upload the input
data that we want to categorize in HDFS
2.
Prepare the data by converting it to input vectors
(more on that later)
3.
Train the machine learning algorithm
4. Test the machine
learning algorithm
Creating the data
set
We need a data set
to train on, you could try using your own, but I would suggest running through
this exercise with the sample provided by Spam Assassin as I’m doing it here.
On your local
machine, create a new directory named ‘dataset’. Download the following files: ham and spam and unzip them
under ‘dataset\ham’ and ‘dataset\spam’ respectively (These are ‘*.tar.bz2’
files so you need an unzip tool like Winzip or 7-Zip to extract them
unfortunately). Now zip up the entire ‘dataset’ folder and copy it onto the
main cluster node (use RDP again). Remote into the main node and create a
‘work’ folder under ‘c:\apps\dist\mahout-0.8\’ and extract your data files
there.
Now we need to
upload the data set to HDFS. Start a ‘Hadoop Command Line’ prompt and enter:
cd
c:\apps\dist\mahout-0.8\work
Hadoop fs –put
dataset dataset
This can take a
while. After it completes, your data is now stored in HDFS in a folder named
‘dataset’.
Converting data
set to input vectors
Now we need to
take all of these input text files and convert them to something Mahout’s
training algorithms can use – they need to be ‘vectorized’. In the ‘Hadoop
Command Line’ prompt execute the following commands:
SET
MahoutDir=c:\apps\dist\mahout-0.8
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
seqdirectory -i /user/admin/dataset -o /user/admin/dataset-seq -ow
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i /user/admin/dataset-seq -o
/user/admin/dataset-vectors -lnorm -nv -wt tfidf
This converts the
input text files to Hadoop vectors. So what happened here? We are telling
Hadoop to load the Mahout package called ‘mahout-examples-0.8-job.jar’ (this
holds the MapReduce jobs required for running the examples packaged with Mahout),
then we tell it to make use of the MahoutDriver Java class which sets up
everything to run Mahout commands (this is my interpretation because I really
don’t know what goes on behind the scenes…) then finally we call two Mahout commands,
seqdirectory and seq2sparse with some parameters pointing to our dataset.
Now the next part
is splitting the resulting vectors into two sets, one we use for training the
algorithm, the other one is for testing. We hide the testing vectors from the
model and only after it has been fully trained, we let it try to classify the
testing vectors without telling it what is ham or spam (don’t worry Mahout will
take care of this for us later on).
Splitting is done
using:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
split -i /user/admin/dataset-vectors/tfidf-vectors --trainingOutput
/user/admin/train-vectors --testOutput /user/admin/test-vectors
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
This splits 40%
into testing vectors and the remaining 60% into training vectors. This
concludes the preparation step.
Training the
Machine Learning model
This is probably
the easiest step, to train a machine learning model using the Naïve Bayes model
run the following command:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
trainnb -i /user/admin/train-vectors -el -o /user/admin/model -li
/user/admin/labelindex -ow
Testing the
Machine Learning model
Also an easy step,
run:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
testnb -i /user/admin/test-vectors -m /user/admin/model -l
/user/admin/labelindex -ow -o /user/admin/testing
After the job
completes it will spit out the result of the test, indication how accurate the
model was in prediction spam, and details on classification errors.
Closing remarks
There are a couple
of gotchas when working with Hadoop/Mahout. Some that I have found are that you
need to provide absolute paths (like /user/admin/dataset) to the input and
output data, otherwise you will run into problems where Mahout stores
intermediate files under the wrong user directory because it is impersonating
another user (hdp).
I’ve also found
that if your input documents are very large in size (10+ MB), you can run into
’heap space’ exceptions. It should not be a problem with the sample I provided
here, but if you train on your own data, it could be an issue. I simply deleted
the input samples that were very large to avoid this problem.
Finally, the
algorithm I’ve been using is Naïve Bayes, and it is a very simple Machine
Learning algorithm, which usually does not give high accuracy on predictions
(say 95+%). The field of Machine Learning has develop better algorithms, but unfortunately
Mahout is missing support for some very powerful algorithms like Support Vector
Machines (SVM) which are very popular in Machine Learning.
This example
demonstrated how to apply Machine Learning to classify spam, but as I mentioned
previously you can apply it to any text documents just by replacing the data
set with your own as long as you can provide sufficient data samples and
accurately categorize the training data, you should be able to train a machine
learning model.
Links