In this article
I’m going to show how to setup a machine learning algorithm to classify documents
using a Big Data analytics cluster running in Windows Azure (HDInsight).
Azure HDInsight is
an easy and flexible way to provision an analytics cluster in the cloud for
doing Big Data analysis. A cluster consists of 1 main node, 1 job tracker node
and any number of compute nodes (workers), that are doing the heavy lifting. The
cluster is backed by the Open Source Big Data platform called Hadoop. Hadoop is written in Java and it is an extensible
open platform for analyzing large sets of data in a fast and resilient fashion.
It’s built on top of Hadoop Distributed File System (HDFS) which is a
distributed file system similar to Google’s GFS and the so-called Map-Reduce
algorithm which distributes (map) work among any number of cluster nodes and
collects (reduce) the result to a single output (usually stored in HDFS).
Mahout is an extension library built on top of Hadoop which enables
Machine Learning algorithms to run in the cluster.
I’m assuming you
have a Windows Azure subscription, if not it is possible to get free access to
Azure with your MSDN subscription, which you should be able to obtain if you
are a student or you can get a 1-month free trial here.
Setting up the Windows
Azure HDInsight cluster
As of August 2013
you need to sign up for the preview features of Azure as HDInsight is not part
of the standard Azure package, this may change in the near future when
HDInsight is completed. The steps to do so are outlined here, but I’ll repeat
them:
·
Try to create an HDInsight data service in the Windows
Azure Management Portal, it will be grayed out and instead you should see a
link to ‘preview features’. Follow that link and sign up for HDInsight.
After you sign up
and get a completed notification, you are ready to provision an HDInsight
cluster, but first we need to setup a dedicated storage account (which is not geo-replicated)
·
Create a new storage account, making sure to uncheck
the georeplicated box
Now we provision
our cluster
·
Make a new provisioning and select HDInsight cluster
under the data services category
·
Pick the storage account you created in the previous
step
·
Select a size for the cluster (small will do just
fine) and a strong password
·
Click create
A note here: I’ve
encountered an error when trying this step, where it fails on the second
operation. I just wait a couple of minutes and retry and things work out fine.
I think this is because I’m too fast after enabling the preview feature to the
time I provision the cluster, but I’m not certain.
Logging onto the
cluster
After the cluster
is created you can find it in its own category in the Windows Azure Management
Portal under HDInsight. Select your cluster and click ‘Manage Cluster’ this
will bring you into an HDInsight specific portal which will look something like
this:
Click the ‘Remote
Desktop’ tile which will open an *.rdp file to connect to your main node.
Running Hadoop command
Once you log onto
the main cluster node you should find a shortcut on the desktop by the name
‘Hadoop Command Line’. Click it to start a Hadoop prompt, which will show you
some high level help:
Hadoop is the
engine behind our machine learning example, so it makes sense to explore it a
bit, however, I won’t go into details with it because my knowledge is limited
at this point.
The fundamental
functionality of Hadoop has to do with interacting with the file system. Keep
in mind that in the cluster you have a completely separate file system (HDFS)
which is not visible to Windows (it’s backed by the storage account you created
before provisioning the cluster). Hadoop exposes functionality to work with the
HDFS file system, you get to them using:
Hadoop fs
-<command> <args>
File system
commands have Unix names (like ls, rm, cat, etc). So to upload a file from the
local disk to HDFS, you simply execute:
Hadoop fs –put
hdfs_folder/hdfs_file local_folder\local_file
This will upload a
local file (or folder) to HDFS and the data is now immediately available to the
compute nodes. For more info about file system commands check the documentation.
Notice: The path
formats differ between local path and HDFS path. The HDFS format is Unix while
the local file system is of course Windows path format.
Another thing to
keep in mind is that for performance reasons the HDFS file system is immutable,
that means once data is placed on the file system you cannot edit it. If you
did something wrong, you would need to delete the files and then recreate them.
The last thing we
need to know about Hadoop is that you run command using the ‘jar’ command,
which runs functionality exposed by Java packages in Hadoop. One example could
be to run a word counting job to count occurrences of words in a text file.
Create a new test file named mytextfile.txt and type something into it, then
run the following steps:
Hadoop fs –put
mytextfile.txt test/input.txt
Hadoop jar hadoop-examples.jar wordcount
test/input.txt output
Hadoop fs –cat output/*
This will create a
job that runs in the cluster, scanning the ‘mytextfile.txt’ for all words and
returns a list of words and their respective number of occurrences in the file.
In my case the input and output looks like this:
The way to
interpret the output is that the word ‘a’ occurred once, ‘hello’ occurred
twice, etc. The output of this command is a perfect example of a ‘vector’ where
each word is a dimension in the vector and the occurrence count is the value in
that dimension of the vector. As we shall see later, vectors are a recurring
theme in Hadoop – large data sets are almost always represented as vectors and
matrices.
This is of course
a trivial example and it is complete overkill to use a data analytics cluster
to run this job. However, the same principles apply to Big Data analysis it
just takes a lot longer, and that’s why it’s nice to have a cluster helping you
out.
Machine Learning
in Azure
Time to move on to
Machine Learning! The first step is to install Mahout onto the cluster. Go here, then choose the
closest mirror to you and open the folder 0.8. You should see the following
Download 'mahout-distribution-0.8.zip',
copy it to the clusters main node using RDP (Windows copy/paste over remote
desktop works), unzip it and place the files under 'C:\apps\dist\mahout-0.8'.
Now that is in
place, we can start using Mahout for doing Machine Learning. The example I’m
going to run through here is that of a ‘classification’ model. This is a model
that takes incoming documents (think of the in general, a document could be an
email, a log file, a web-request, etc.) and tries to predict something about this document. Now this
‘something’ is entirely up to us to decide upon. For emails the classic example
is to predict if an email is spam or not (we call that spam or ham). For
web-requests we could try to classify if the request is legitimate or a
denial-of-service (DoS) request sent from an attacker. Whatever it is we are
trying to predict, the general idea is that we must provide the algorithm with a
training data set in which we have told it for each document what the outcome
was (e.g. spam or ham), such that the algorithm can learn from this fact.
But what do we get
out of that exactly? One document on its own is not going to be very helpful,
but the premise of machine learning is that you feed the model tons of training data (this is where
machine learning ties into Big Data), and by churning through all this data,
the model will detect hidden patterns in the data that can be used to predict
the outcome of future documents (without knowing if they are actually spam or
not ahead of time).
So how does this
work? Let’s try to break it down. For reference I have used link [6] in the
bottom to get started on this, but that introduction is based on Mahout 0.5
which is outdated by now, so I would not suggest following it.
We need to perform
the following steps for making a classification algorithm:
1. Upload the input
data that we want to categorize in HDFS
2.
Prepare the data by converting it to input vectors
(more on that later)
3.
Train the machine learning algorithm
4. Test the machine
learning algorithm
Creating the data
set
We need a data set
to train on, you could try using your own, but I would suggest running through
this exercise with the sample provided by Spam Assassin as I’m doing it here.
On your local
machine, create a new directory named ‘dataset’. Download the following files: ham and spam and unzip them
under ‘dataset\ham’ and ‘dataset\spam’ respectively (These are ‘*.tar.bz2’
files so you need an unzip tool like Winzip or 7-Zip to extract them
unfortunately). Now zip up the entire ‘dataset’ folder and copy it onto the
main cluster node (use RDP again). Remote into the main node and create a
‘work’ folder under ‘c:\apps\dist\mahout-0.8\’ and extract your data files
there.
Now we need to
upload the data set to HDFS. Start a ‘Hadoop Command Line’ prompt and enter:
cd
c:\apps\dist\mahout-0.8\work
Hadoop fs –put
dataset dataset
This can take a
while. After it completes, your data is now stored in HDFS in a folder named
‘dataset’.
Converting data
set to input vectors
Now we need to
take all of these input text files and convert them to something Mahout’s
training algorithms can use – they need to be ‘vectorized’. In the ‘Hadoop
Command Line’ prompt execute the following commands:
SET
MahoutDir=c:\apps\dist\mahout-0.8
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
seqdirectory -i /user/admin/dataset -o /user/admin/dataset-seq -ow
hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i /user/admin/dataset-seq -o
/user/admin/dataset-vectors -lnorm -nv -wt tfidf
This converts the
input text files to Hadoop vectors. So what happened here? We are telling
Hadoop to load the Mahout package called ‘mahout-examples-0.8-job.jar’ (this
holds the MapReduce jobs required for running the examples packaged with Mahout),
then we tell it to make use of the MahoutDriver Java class which sets up
everything to run Mahout commands (this is my interpretation because I really
don’t know what goes on behind the scenes…) then finally we call two Mahout commands,
seqdirectory and seq2sparse with some parameters pointing to our dataset.
Now the next part
is splitting the resulting vectors into two sets, one we use for training the
algorithm, the other one is for testing. We hide the testing vectors from the
model and only after it has been fully trained, we let it try to classify the
testing vectors without telling it what is ham or spam (don’t worry Mahout will
take care of this for us later on).
Splitting is done
using:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
split -i /user/admin/dataset-vectors/tfidf-vectors --trainingOutput
/user/admin/train-vectors --testOutput /user/admin/test-vectors
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
This splits 40%
into testing vectors and the remaining 60% into training vectors. This
concludes the preparation step.
Training the
Machine Learning model
This is probably
the easiest step, to train a machine learning model using the Naïve Bayes model
run the following command:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
trainnb -i /user/admin/train-vectors -el -o /user/admin/model -li
/user/admin/labelindex -ow
Testing the
Machine Learning model
Also an easy step,
run:
hadoop jar
%MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver
testnb -i /user/admin/test-vectors -m /user/admin/model -l
/user/admin/labelindex -ow -o /user/admin/testing
After the job
completes it will spit out the result of the test, indication how accurate the
model was in prediction spam, and details on classification errors.
Closing remarks
There are a couple
of gotchas when working with Hadoop/Mahout. Some that I have found are that you
need to provide absolute paths (like /user/admin/dataset) to the input and
output data, otherwise you will run into problems where Mahout stores
intermediate files under the wrong user directory because it is impersonating
another user (hdp).
I’ve also found
that if your input documents are very large in size (10+ MB), you can run into
’heap space’ exceptions. It should not be a problem with the sample I provided
here, but if you train on your own data, it could be an issue. I simply deleted
the input samples that were very large to avoid this problem.
Finally, the
algorithm I’ve been using is Naïve Bayes, and it is a very simple Machine
Learning algorithm, which usually does not give high accuracy on predictions
(say 95+%). The field of Machine Learning has develop better algorithms, but unfortunately
Mahout is missing support for some very powerful algorithms like Support Vector
Machines (SVM) which are very popular in Machine Learning.
This example
demonstrated how to apply Machine Learning to classify spam, but as I mentioned
previously you can apply it to any text documents just by replacing the data
set with your own as long as you can provide sufficient data samples and
accurately categorize the training data, you should be able to train a machine
learning model.
Links
very good article , Can we run this without using the split command
ReplyDeleteI like the blog format as you create user engagement in the complete article. Thanks for the informative posts.
ReplyDeleteGerman Course in Chennai
German Training in Chennai
French Class in Chennai
French Training
Hadoop Training in Chennai
Big Data Training in Chennai
More informative,thanks for sharing with us.this blog makes the readers more enjoyable.keep add more info on your page.
ReplyDeleteCloud computing institutes in Bangalore
Cloud Computing Training in Thirumangalam
Cloud Computing Training in Vadapalani
Cloud Computing Training in Karapakkam
Outstanding blog thanks for sharing such wonderful blog with us ,after long time came across such knowlegeble blog. keep sharing such informative blog with us.
ReplyDeleteAirport Management Courses in Chennai | Airport Management Training in Chennai | Airline Courses in Chennai | Airport Courses in Chennai | Airline and Airport Management Courses in Chennai
This was a helpful to me thanks for sharing these useful information. Kindly continue the work.
ReplyDeleteIELTS Coaching in Mulund | IELTS Training in Mulund West | IELTS Centres in Mulund East | Best IELTS Coaching Institute in Mulund | IELTS Courses in Mulund | IELTS Coaching Centres in Mulund | IELTS Training in Mulund
It was really a nice article and i was really impressed by reading this Big Data Hadoop Online Training Bangalore
ReplyDeleteThank you for sharing this wonderful information. I really enjoyed reading this.
ReplyDeleteCorporate Training in Chennai | corporate training | corporate training companies | Corporate Training in Adyar | Corporate Training Institute in Velachery | Corporate Training Institute in Tambaram
Amazing Article, thank you!.I am very glad to read your informative blog. Kindly keep updating.
ReplyDeleteCertified Ethical Hacking Course in Bangalore
Ethical Hacking Training in Ambattur
Ethical Hacking Training in Tnagar
Very good blog, thanks for sharing such a wonderful blog with us. Keep sharing such worthy information to my vision.
ReplyDeleteSpring Training in Chennai
Spring course in Chennai
Spring Hibernate Training in Chennai
Spring Hibernate Training
Struts Training
Struts Training near me
ReplyDeleteVery interesting blog. It helps me to get the in depth knowledge. Thanks for sharing such a nice blog
Pega Training in Chennai
Pega Course in Chennai
Pega Training Institutes in Chennai
Pega Training Institute in Chennai
Pega Course
Pega Training
Pega Certification Training
Pega Developer Training
It was really a nice article and I was really impressd by reading this.
ReplyDeleteThank you for such amazing post. Keep up the good work.
Primavera Training in Chennai
Primavera Course in Chennai
Primavera Software Training in Chennai
Best Primavera Training in Chennai
Primavera p6 Training in Chennai
Primavera Coaching in Chennai
Primavera Course
Great Post. It shows your deep understanding of the topic. Thanks for Posting.
ReplyDeleteNode JS Training in Chennai
Node JS Course in Chennai
Node JS Advanced Training
Node JS Training Institute in chennai
Node JS Training Institutes in chennai
Node JS Course
Article looks pretty impressive. Please keep writing!
ReplyDeletegood...
ReplyDeleteinplant training in chennai
inplant training in chennai for it
Bermuda web hosting
Botswana hosting
armenia web hosting
dominican republic web hosting
iran hosting
palestinian territory web hosting
iceland web hosting
goof
ReplyDeleteBermuda web hosting
Botswana hosting
armenia web hosting
lithuania shared web hosting
inplant training in chennai
inplant training in chennai for it
suden web hosting
tunisia hosting
uruguay web hosting
I really like the presentation of your article. Great way to go ahead.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
trendyol indirim kodu
ReplyDeletecami avizesi
cami avizeleri
avize cami
no deposit bonus forex 2021
takipçi satın al
takipçi satın al
takipçi satın al
takipcialdim.com/tiktok-takipci-satin-al/
instagram beğeni satın al
instagram beğeni satın al
btcturk
tiktok izlenme satın al
sms onay
youtube izlenme satın al
no deposit bonus forex 2021
tiktok jeton hilesi
tiktok beğeni satın al
binance
takipçi satın al
uc satın al
sms onay
sms onay
tiktok takipçi satın al
tiktok beğeni satın al
twitter takipçi satın al
trend topic satın al
youtube abone satın al
instagram beğeni satın al
tiktok beğeni satın al
twitter takipçi satın al
trend topic satın al
youtube abone satın al
takipcialdim.com/instagram-begeni-satin-al/
perde modelleri
instagram takipçi satın al
instagram takipçi satın al
takipçi satın al
instagram takipçi satın al
betboo
marsbahis
nices information thanku so much this information
ReplyDeletethanku so much
nices
MMORPG OYUNLAR
ReplyDeleteınstagram takipci satin al
tiktok jeton hilesi
Tiktok jeton hilesi
antalya saç ekimi
referans kimliği nedir
instagram takipçi satın al
Metin Pvp
instagram takipçi satın al
perde modelleri
ReplyDeletesms onay
Vodafone Mobil Ödeme Bozdurma
NFT NASİL ALİNİR
ankara evden eve nakliyat
trafik sigortası
dedektör
Website kurma
aşk romanları
smm panel
ReplyDeletesmm panel
İS İLANLARİ
instagram takipçi satın al
Hirdavatciburada.com
beyazesyateknikservisi.com.tr
servis
tiktok jeton hilesi
Good text Write good content success. Thank you
ReplyDeletekralbet
bonus veren siteler
slot siteleri
betpark
mobil ödeme bahis
kibris bahis siteleri
betmatik
poker siteleri
I really like the presentation of your article. Great way to go ahead.
ReplyDeletejewellery erp software
Jewellery erp software
amasya
ReplyDeleteantakya
edirne
elazığ
kayseri
WLM3İ7
kıbrıs
ReplyDeletetrabzon
zonguldak
mersin
diyarbakır
KS1
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
UBFD6
href="https://istanbulolala.biz/">https://istanbulolala.biz/
ReplyDelete6PXR8D
65F4D
ReplyDeleteNiğde Şehirler Arası Nakliyat
Ağrı Şehirler Arası Nakliyat
Çankaya Fayans Ustası
Big Wolf Coin Hangi Borsada
Sincan Parke Ustası
Bingöl Şehir İçi Nakliyat
Qlc Coin Hangi Borsada
Bolu Lojistik
Karapürçek Boya Ustası
FB1BB
ReplyDeleteSilivri Fayans Ustası
Mamak Parke Ustası
Aydın Parça Eşya Taşıma
Cate Coin Hangi Borsada
Hatay Şehir İçi Nakliyat
Ort Coin Hangi Borsada
Mardin Şehir İçi Nakliyat
Kırşehir Şehir İçi Nakliyat
Yozgat Şehir İçi Nakliyat
B86E1
ReplyDeleteKilis Şehirler Arası Nakliyat
İstanbul Lojistik
Eskişehir Evden Eve Nakliyat
Çankırı Şehir İçi Nakliyat
Manisa Şehirler Arası Nakliyat
Bursa Şehir İçi Nakliyat
Expanse Coin Hangi Borsada
Antalya Rent A Car
Bursa Lojistik
E8B7F
ReplyDeleteYobit Güvenilir mi
Çerkezköy Cam Balkon
Kırıkkale Evden Eve Nakliyat
Tekirdağ Çatı Ustası
Hatay Evden Eve Nakliyat
Binance Referans Kodu
Kütahya Evden Eve Nakliyat
Tunceli Evden Eve Nakliyat
Malatya Evden Eve Nakliyat
5C3C9
ReplyDeleteReferans Kimliği Nedir
Ünye Oto Boya
Muş Evden Eve Nakliyat
Manisa Evden Eve Nakliyat
Binance Referans Kodu
Çerkezköy Petek Temizleme
Ünye Oto Lastik
Kucoin Güvenilir mi
Çerkezköy Evden Eve Nakliyat
202FD
ReplyDeleteBitcoin Nasıl Oynanır
resimlimagnet
Binance Para Kazanma
Okex Borsası Güvenilir mi
resimli magnet
Binance Kaldıraçlı İşlem Nasıl Yapılır
Kripto Para Madenciliği Siteleri
Coin Çıkarma Siteleri
Binance'de Kaldıraç Var mı
32D37
ReplyDeleteBinance Kaldıraçlı İşlem Nasıl Yapılır
Bitcoin Yatırımı Nasıl Yapılır
Binance Nasıl Üye Olunur
Binance Ne Zaman Kuruldu
Coin Nasıl Çıkarılır
Mexc Borsası Kimin
Binance Hangi Ülkenin
Gate io Borsası Güvenilir mi
Binance Sahibi Kim
CF16F
ReplyDeleteBulut Madenciliği Nedir
Mexc Borsası Kimin
Bulut Madenciliği Nedir
Binance Hesap Açma
Bitcoin Kazanma
Binance Hangi Ülkenin
Bitcoin Nasıl Çıkarılır
Coin Nasıl Çıkarılır
Coin Oynama
2CAA0
ReplyDeleteBinance Hesap Açma
Kripto Para Oynama
Binance Nasıl Kayıt Olunur
Coin Kazma
resimlimagnet
Kripto Para Çıkarma Siteleri
Kripto Para Kazma Siteleri
Coin Kazanma Siteleri
Binance Ne Kadar Komisyon Alıyor
2B28C
ReplyDeleteprobit
bibox
canlı sohbet uygulamaları
canlı sohbet
bitcoin hesabı nasıl açılır
huobi
bitexen
binance referans kodu
referans kimliği nedir
259B7
ReplyDeletekucoin
kripto para telegram
okex
bitcoin nasıl üretilir
huobi
probit
binance
btcturk
binance referans
85504
ReplyDeletegüvenilir kripto para siteleri
March 2024 Calendar
bitcoin giriş
vindax
coin nasıl alınır
September 2024 Calendar
kraken
filtre kağıdı
August 2024 Calendar
JNYHGGJKMY
ReplyDeleteشركة رش حشرات بالاحساء
شركة تسليك مجاري بخميس مشيط v5cYrW9JMD
ReplyDelete