How to convert KDDCUP dataset(track1) to Graphlab format

You can see the instruction from Danny Bickson’s blog

http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html

Download KDD Cup Yahoo! Music dataset(track1) from http://kddcup.yahoo.com

 

Method1:Using Matlab
You should install Matlab/QtOctave on Ubuntu 11.04 first.
1.Download the file save_c_gl4a.m to your local directory:

$ wget http://www.graphlab.ml.cmu.edu/save_c_gl4a.m

2. Use the Matlab script(dataconvert.m) to convert the text dataset in binary graphlab format:
Note that you need to run the script 3 times – for runmode=1 (training data), runmode= 2 (validation data), runmode=3 (test data),and you should setup your dataset directory in script.

 

Method2:Using python (written by Sanmi Koyejo)

1.convert dataset by readKdddata_1.py

$ python readKdddata_1.py -i trainIdx1.txt -o kddcup
$ python readKdddata_1.py -i validationIdx1.txt -o kddcupe
$ python readKdddata_1.py -i testIdx1.txt -o kddcupt -t

 

2.covert dataset by readKddData_2.py (cost lower memory and execute fast)

$ python readKddData_2.py -i trainIdx1.txt -o kddcup -f 1
$ python readKddData_2.py -i validationIdx1.txt -o kddcupe -f 2
$ python readKddData_2.py -i testIdx1.txt -o kddcupt -f 3

That warning(“Warning: input parameters differ from output parameters, graphlab pmf may not run correctly !!!”) can be ignored. The validation set and test set do not contain every item.

$ ls -l kddcup*
-rw-r–r– 1 sil sil 4044804416 2011-06-27 18:18 kddcup
-rw-r–r– 1 sil sil   64063376 2011-06-28 12:51 kddcupe
-rw-r–r– 1 sil sil   96095056 2011-06-28 16:22 kddcupt

 

3.Check MD5 to verify creation of inputs.

$ md5sum kddcupe
aa76bb1d0e6e897e270ed65d021ed1d8  kddcupe
$ md5sum kddcupt
917599ce7f715890a2705dc04851ac12  kddcupt
$ md5sum kddcup
345b168a208757b3098c6674b2fb653a  kddcup

If you got different output, please check carefully that the command line arguments used are as instructed.

 

Matlab script take much more time than python script.Using matlab script to convert validationIdx1.txt and testIdx1.txt need  5-6 hours,python only need <20 minutes.Even using python, trainIdx1.txt may take a couple of hours to finish depends on your machine.(for me it’s 4 hours)

分享到: 更多

Leave a Reply