Saturday, February 2, 2013

Case study: million songs dataset

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli's cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Instructions for computing item to item similarities:

1) For obtaining the dataset, download and extract this zip file.

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.
$ sh createTrain.sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

./toolkits/collaborative_filtering/itemcf --training=train --K=500 --asym_cosine_alpha=0.15 --distance=3 --min_allowed_intersection=5
Explanation: --training points to the training file. --K=500 means we compute the top 500 similar items.
--distance=3 is Aillio's metric. --min_allowed_intersection=5 - means we take into account only items that were rated together by at least 5 users.

Note: this operation requires around 20GB of memory and may take a few ours...

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities
$ rm -fR train.* train-topk.*
$ ./toolkits/collaborative_filtering/itemsim2rating --training=train --similarity=train-topk --K=500 membudget_mb 50000 --nshards=1 --max_iter=2 --Q=3 --clean_cache=1
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk --training=test --K=500

Output file is: test.ids

2) Prepare training recommendations: 
./toolkits/parsers/topk --training=train-rec --K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval --training=train-rec.ids --test=test.ids --K=500

About performance: 

With the following settings: --min_allowed_intersection=5, K=500, Q=1, alpha=0.15 we get:
INFO:     metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances. 
ESC[0mINFO:     metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431

With --min_allowed_intersection=1, K=2500, Q=1, alpha=0.15 we get:

INFO:     metric_eval.cpp(eval_metrics:114): 6.0811 Finished evaluating 100000 instances.
ESC[0mINFO:     metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.167994


Acknowledgements:

  • Clive Cox, RummbleLabs.com for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
  • Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

No comments:

Post a Comment