Applied Model Based Testing - Experiences & Examples

So I was reading up on PowerShell arrays on this TechNet site:

http://technet.microsoft.com/en-us/library/ee692797.aspx

It gives a brief introduction to PowerShell arrays and illustrates how powerful they are. However, arrays in PowerShell are a curious beast indeed and in some cases they can deceive you, so I thought I'd give some additional insights into it in the hopes it may save you from struggling with them for too long.

Let's start with the basics. One thing you can do with arrays is concatenation:

$a = 1,2,3

$a += 4

Now $a is 1,2,3,4. Neat.

But wait a minute… let's try changing this example a bit

$a = CommandThatReturnsIntegers

$a += 4

If CommandThatReturnsIntegers returns 1,2,3 then everything works as expected. But what if it just returns 1?

$a = 1

$a += 4

Now $a is 5… oops? What just happened? Well, in essence we expected an array from CommandThatReturnsIntegers but instead it gave us an integer. Well, we could fix it such that it always returns @(1) instead of 1, then we get:

$a = @(1)

$a += 4

Now $a is 1,4. But what if we don't control the output of CommandThatReturnsIntegers, then we are out of luck with this approach…

An alternative is to use the comma operator, like this:

$a = 1

$a = $a,4

$a now is 1,4 as we would like it to be.

Let's try that with a starting array

$a = 1,2,3

$a = $a,4

Again, when evaluated $a is 1,2,3,4 as expected. Everything looks fine right?

Now try this:

$a.count

2? Wait a minute it evaluated to 1,2,3,4 yet the number of elements in the array is 2?

Let's see what we find when we look under the hood..

$a | % { "$_" }

Here I'm telling PowerShell to produce a string representation of the content of every element in the array. The output is:

1 2 3

So, as you may have guessed already underneath the hood $a is an array of arrays.

This is a common source of trouble when piping inputs to other commands. We would expect to evaluate the command on each element not each nested array.

As an example we might want to sum the values using:

$a | measure -Sum

However, this raises the exception:

Input object "System.Object[]" is not numeric.

But then why does $a evaluate to 1,2,3,4 when we first looked at it? The reason is that PowerShell is invoking black magic to unroll the nested array on our behalf when we evaluate it!

That's pretty bad and causes much deception in PowerShell-land. But there's a plus side to this, we can leverage (abuse?) PowerShell's black magic to unroll the array for us, using the following syntax:

$b = $a | % { $_ }

Now $b is a truly flattened array. We can test this using:

$a | % { $_ } | measure -Sum

Which now gives us the expected value of 10.

The benefit of this syntax is that it will work regardless of $a is a single element, an array or a nested array.

In conclusion when working with arrays take care and try to understand what goes on underneath the hood. If you end up in a situation where the array is not behaving as expected, do not rely on evaluating it in PowerShell to inspect it's structure. Instead, use $a | % { "$_" } to see what it actually holds. In case you are working with arrays and concatenation, make sure to build your command robust by unrolling the input arrays using $a | % { $_ }.

In this article I’m going to show how to setup a machine learning algorithm to classify documents using a Big Data analytics cluster running in Windows Azure (HDInsight).

Azure HDInsight is an easy and flexible way to provision an analytics cluster in the cloud for doing Big Data analysis. A cluster consists of 1 main node, 1 job tracker node and any number of compute nodes (workers), that are doing the heavy lifting. The cluster is backed by the Open Source Big Data platform called Hadoop. Hadoop is written in Java and it is an extensible open platform for analyzing large sets of data in a fast and resilient fashion. It’s built on top of Hadoop Distributed File System (HDFS) which is a distributed file system similar to Google’s GFS and the so-called Map-Reduce algorithm which distributes (map) work among any number of cluster nodes and collects (reduce) the result to a single output (usually stored in HDFS).

Mahout is an extension library built on top of Hadoop which enables Machine Learning algorithms to run in the cluster.

I’m assuming you have a Windows Azure subscription, if not it is possible to get free access to Azure with your MSDN subscription, which you should be able to obtain if you are a student or you can get a 1-month free trial here.

Setting up the Windows Azure HDInsight cluster

As of August 2013 you need to sign up for the preview features of Azure as HDInsight is not part of the standard Azure package, this may change in the near future when HDInsight is completed. The steps to do so are outlined here, but I’ll repeat them:

· Try to create an HDInsight data service in the Windows Azure Management Portal, it will be grayed out and instead you should see a link to ‘preview features’. Follow that link and sign up for HDInsight.

After you sign up and get a completed notification, you are ready to provision an HDInsight cluster, but first we need to setup a dedicated storage account (which is not geo-replicated)

· Create a new storage account, making sure to uncheck the georeplicated box

Now we provision our cluster

· Make a new provisioning and select HDInsight cluster under the data services category

· Pick the storage account you created in the previous step

· Select a size for the cluster (small will do just fine) and a strong password

· Click create

A note here: I’ve encountered an error when trying this step, where it fails on the second operation. I just wait a couple of minutes and retry and things work out fine. I think this is because I’m too fast after enabling the preview feature to the time I provision the cluster, but I’m not certain.

Logging onto the cluster

After the cluster is created you can find it in its own category in the Windows Azure Management Portal under HDInsight. Select your cluster and click ‘Manage Cluster’ this will bring you into an HDInsight specific portal which will look something like this:

Click the ‘Remote Desktop’ tile which will open an *.rdp file to connect to your main node.

Running Hadoop command

Once you log onto the main cluster node you should find a shortcut on the desktop by the name ‘Hadoop Command Line’. Click it to start a Hadoop prompt, which will show you some high level help:

Hadoop is the engine behind our machine learning example, so it makes sense to explore it a bit, however, I won’t go into details with it because my knowledge is limited at this point.

The fundamental functionality of Hadoop has to do with interacting with the file system. Keep in mind that in the cluster you have a completely separate file system (HDFS) which is not visible to Windows (it’s backed by the storage account you created before provisioning the cluster). Hadoop exposes functionality to work with the HDFS file system, you get to them using:

Hadoop fs -<command> <args>

File system commands have Unix names (like ls, rm, cat, etc). So to upload a file from the local disk to HDFS, you simply execute:

Hadoop fs –put hdfs_folder/hdfs_file local_folder\local_file

This will upload a local file (or folder) to HDFS and the data is now immediately available to the compute nodes. For more info about file system commands check the documentation.

Notice: The path formats differ between local path and HDFS path. The HDFS format is Unix while the local file system is of course Windows path format.

Another thing to keep in mind is that for performance reasons the HDFS file system is immutable, that means once data is placed on the file system you cannot edit it. If you did something wrong, you would need to delete the files and then recreate them.

The last thing we need to know about Hadoop is that you run command using the ‘jar’ command, which runs functionality exposed by Java packages in Hadoop. One example could be to run a word counting job to count occurrences of words in a text file. Create a new test file named mytextfile.txt and type something into it, then run the following steps:

Hadoop fs –put mytextfile.txt test/input.txt

Hadoop jar hadoop-examples.jar wordcount test/input.txt output

Hadoop fs –cat output/*

This will create a job that runs in the cluster, scanning the ‘mytextfile.txt’ for all words and returns a list of words and their respective number of occurrences in the file. In my case the input and output looks like this:

The way to interpret the output is that the word ‘a’ occurred once, ‘hello’ occurred twice, etc. The output of this command is a perfect example of a ‘vector’ where each word is a dimension in the vector and the occurrence count is the value in that dimension of the vector. As we shall see later, vectors are a recurring theme in Hadoop – large data sets are almost always represented as vectors and matrices.

This is of course a trivial example and it is complete overkill to use a data analytics cluster to run this job. However, the same principles apply to Big Data analysis it just takes a lot longer, and that’s why it’s nice to have a cluster helping you out.

Machine Learning in Azure

Time to move on to Machine Learning! The first step is to install Mahout onto the cluster. Go here, then choose the closest mirror to you and open the folder 0.8. You should see the following

Download 'mahout-distribution-0.8.zip', copy it to the clusters main node using RDP (Windows copy/paste over remote desktop works), unzip it and place the files under 'C:\apps\dist\mahout-0.8'.

Now that is in place, we can start using Mahout for doing Machine Learning. The example I’m going to run through here is that of a ‘classification’ model. This is a model that takes incoming documents (think of the in general, a document could be an email, a log file, a web-request, etc.) and tries to predict something about this document. Now this ‘something’ is entirely up to us to decide upon. For emails the classic example is to predict if an email is spam or not (we call that spam or ham). For web-requests we could try to classify if the request is legitimate or a denial-of-service (DoS) request sent from an attacker. Whatever it is we are trying to predict, the general idea is that we must provide the algorithm with a training data set in which we have told it for each document what the outcome was (e.g. spam or ham), such that the algorithm can learn from this fact.

But what do we get out of that exactly? One document on its own is not going to be very helpful, but the premise of machine learning is that you feed the model tons of training data (this is where machine learning ties into Big Data), and by churning through all this data, the model will detect hidden patterns in the data that can be used to predict the outcome of future documents (without knowing if they are actually spam or not ahead of time).

So how does this work? Let’s try to break it down. For reference I have used link [6] in the bottom to get started on this, but that introduction is based on Mahout 0.5 which is outdated by now, so I would not suggest following it.

We need to perform the following steps for making a classification algorithm:

1. Upload the input data that we want to categorize in HDFS

2. Prepare the data by converting it to input vectors (more on that later)

3. Train the machine learning algorithm

4. Test the machine learning algorithm

Creating the data set

We need a data set to train on, you could try using your own, but I would suggest running through this exercise with the sample provided by Spam Assassin as I’m doing it here.

On your local machine, create a new directory named ‘dataset’. Download the following files: ham and spam and unzip them under ‘dataset\ham’ and ‘dataset\spam’ respectively (These are ‘*.tar.bz2’ files so you need an unzip tool like Winzip or 7-Zip to extract them unfortunately). Now zip up the entire ‘dataset’ folder and copy it onto the main cluster node (use RDP again). Remote into the main node and create a ‘work’ folder under ‘c:\apps\dist\mahout-0.8\’ and extract your data files there.

Now we need to upload the data set to HDFS. Start a ‘Hadoop Command Line’ prompt and enter:

cd c:\apps\dist\mahout-0.8\work

Hadoop fs –put dataset dataset

This can take a while. After it completes, your data is now stored in HDFS in a folder named ‘dataset’.

Converting data set to input vectors

Now we need to take all of these input text files and convert them to something Mahout’s training algorithms can use – they need to be ‘vectorized’. In the ‘Hadoop Command Line’ prompt execute the following commands:

SET MahoutDir=c:\apps\dist\mahout-0.8

hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i /user/admin/dataset -o /user/admin/dataset-seq -ow

hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i /user/admin/dataset-seq -o /user/admin/dataset-vectors -lnorm -nv -wt tfidf

This converts the input text files to Hadoop vectors. So what happened here? We are telling Hadoop to load the Mahout package called ‘mahout-examples-0.8-job.jar’ (this holds the MapReduce jobs required for running the examples packaged with Mahout), then we tell it to make use of the MahoutDriver Java class which sets up everything to run Mahout commands (this is my interpretation because I really don’t know what goes on behind the scenes…) then finally we call two Mahout commands, seqdirectory and seq2sparse with some parameters pointing to our dataset.

Now the next part is splitting the resulting vectors into two sets, one we use for training the algorithm, the other one is for testing. We hide the testing vectors from the model and only after it has been fully trained, we let it try to classify the testing vectors without telling it what is ham or spam (don’t worry Mahout will take care of this for us later on).

Splitting is done using:

hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver split -i /user/admin/dataset-vectors/tfidf-vectors --trainingOutput /user/admin/train-vectors --testOutput /user/admin/test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

This splits 40% into testing vectors and the remaining 60% into training vectors. This concludes the preparation step.

Training the Machine Learning model

This is probably the easiest step, to train a machine learning model using the Naïve Bayes model run the following command:

hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver trainnb -i /user/admin/train-vectors -el -o /user/admin/model -li /user/admin/labelindex -ow

Testing the Machine Learning model

Also an easy step, run:

hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver testnb -i /user/admin/test-vectors -m /user/admin/model -l /user/admin/labelindex -ow -o /user/admin/testing

After the job completes it will spit out the result of the test, indication how accurate the model was in prediction spam, and details on classification errors.

Closing remarks

There are a couple of gotchas when working with Hadoop/Mahout. Some that I have found are that you need to provide absolute paths (like /user/admin/dataset) to the input and output data, otherwise you will run into problems where Mahout stores intermediate files under the wrong user directory because it is impersonating another user (hdp).

I’ve also found that if your input documents are very large in size (10+ MB), you can run into ’heap space’ exceptions. It should not be a problem with the sample I provided here, but if you train on your own data, it could be an issue. I simply deleted the input samples that were very large to avoid this problem.

Finally, the algorithm I’ve been using is Naïve Bayes, and it is a very simple Machine Learning algorithm, which usually does not give high accuracy on predictions (say 95+%). The field of Machine Learning has develop better algorithms, but unfortunately Mahout is missing support for some very powerful algorithms like Support Vector Machines (SVM) which are very popular in Machine Learning.

This example demonstrated how to apply Machine Learning to classify spam, but as I mentioned previously you can apply it to any text documents just by replacing the data set with your own as long as you can provide sufficient data samples and accurately categorize the training data, you should be able to train a machine learning model.

Links

[1] Official Hadoop website

[2] Official Mahout website

[3] Windows Azure free-trial

[4] Getting Started with Windows Azure HDInsight Service

[5] Simple recommendation engine using Apache Mahout

[6] Introduction to building a spam filter with Mahout (outdated)

[7] Using Mahout to classify tweets

[8] Hadoop Shell Commands

Assembly	From	To
Microsoft.ServiceBus	2.1.0.0	2.2.0.0
Microsoft.WindowsAzure.Diagnostics	2.1.0.0	2.2.0.0
Microsoft.WindowsAzure.ServiceRuntime	2.1.0.0	2.2.0.0
Microsoft.WindowsAzure.Storage	2.0.0.0	2.1.0.0

Applied Model Based Testing - Experiences & Examples

Thursday, October 9, 2014

PowerShell and arrays a deceptive combination

Friday, November 1, 2013

Upgrading Windows Azure Service projects to Visual Studio 2013 does not work out-of-the-box

Monday, September 2, 2013

Large Scale Machine Learning powered by Windows Azure HDInsight