Tuesday, October 30, 2012

Raspberry Pi

A super cool micro computer running Linux for 25$ :

For all the students looking for cool projects, why not implement some ML methods on top?

Today I met Tal Anker, who told me that Raspberry pi is for amateurs and the serious
guys use Stratus, which comes assembled with some additional features.

Thursday, October 25, 2012

Convex Belief Propagation

Recently an interesting paper named distributed convex BP was published in CVPR 2011. I asked Alex Schwing (ETH Zurich) to give a brief overview of convex BP and his paper.

What is convex belief propagation?
The convex variant of belief propagation, named "convex belief propagation," is a message passing algorithm which naturally extends a variety of approaches (loopy belief propagation, tree-reweighted message passing) by respectively choosing counting numbers. For the original loopy belief propagation algorithm those counting numbers are chosen such that the resulting distribution is exact for a graphical model corresponding to a tree. For loopy graphical models the true distribution is commonly approximated by using counting numbers that correspond to a tree-structured graphical model. It turns out that the resulting cost function which is optimized by the message passing algorithm is non-convex, and loopy belief propagation is not guaranteed to converge. Instead of approximating the true distribution with counting numbers originating from a tree-like model, "convex belief propagation" employs counting numbers that ensure a convex cost function. Hence there is a global optimum and the respective message passing algorithm is guaranteed to converge.
What are counting numbers?
Counting numbers specify how often entropy terms for different subsets of variables are counted in the entropy approximation used in the energy functional. 
If I need to read one paper about convex BP what should I read?
One of the first papers introducing convex belief propagation is T. Hazan and A. Shashua, "Convergent message-passing algorithms for inference over general graphs with convex free energy," UAI 2008. It is also explained within the related work section of the CVPR 2011 work on distributed convex belief propagation.

What are the applications for convex BP?
Convex belief propagation is applicable to any problem that can be addressed by loopy belief propagation. Its computational and memory complexity are identical to the loopy variant and recent applications that benefit from (distributed) convex belief propagation are:

  • M. Salzmann and R. Urtasun; "Beyond Feature Points: Structured Prediction for Monocular Non-rigid 3D Reconstruction" ECCV 2012 

  • K. Yamaguchi, T. Hazan, D. McAllester and R. Urtasun; "Continuous Markov Random Fields for Robust Stereo Estimation"; ECCV 2012
  • J. Yao, S. Fidler and R. Urtasun; "Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation"; CVPR 2012
  • A.G. Schwing, T. Hazan, M. Pollefeys and R. Urtasun; "Efficient Structured Prediction for 3D Indoor Scene Understanding"; CVPR 2012

Efficient implementation for larger models
Distributed convex belief propagation (dcBP) is an inference algorithm that allows the user to partition the task (graph) and solve the sub-problems independently and in parallel on distributed memory compute environments. Efficiency is improved by partitioning the computation onto multiple cluster nodes/machines connected, e.g., by a standard LAN. Most importantly, convergence guarantees of convex belief propagation are maintained by occasionally exchanging information between different machines and efficiency is improved due to additional computation power. Within the paper we compare the distributed computation to a non-distributed version and we observe an improvement almost equivalent to the number of additional machines (as expected). We also show that it is important to reduce communication overhead between machines, even when working on a very common 4-connected grid-like graph. We also compare the algorithm to loopy belief propagation and other message passing implementations.

Wednesday, October 24, 2012

Deploying a GraphLab Cluster using MPI

Note: the MPI section of this toturial is based on this excellent tutorial.


  •  Mpi should be installed

Step 0: Install GraphLab on one of your cluster nodes. 

Using the instructions here on your master node (one of your cluster machines)

Step 1: start MPI

a) Create a file called .mpd.conf in your home directory (only once)
This file should contain a secret password using the format:

b) Verify that MPI is working by running the daemon (only once)
$ mpd &       # run the MPI daemon
$ mpdtrace   # lists all active MPI nodes
$ mpdallexit # kill MPI

c) Spawn the cluster. Create a file named machines with the list of machines you like to deploy.
$ mpd -f machines -n XX  # where XX is the number of MPI nodes you like to spawn.

d) Copy GraphLab files to all machines.
On the node you installed GraphLab on, run the following commands to copy GraphLab files to the rest of the machines:

d1) Verify you have the machines files from section 1c) in your root folder.

d2) Copy the GraphLab files
 cd ~/graphlabapi/release/toolkits;  ~/graphlabapi/scripts/mpirsync 
  cd ~/graphlabapi/deps/local; ~/graphlabapi/scripts/mpirsync 

Step 2: Run GraphLab ALS

This step runs ALS (alternating least squares) in a cluster using small netflix susbset.
It first downloads the data from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, and runs 5 alternating least squares iterations. After the run is completed, you can login into any of the nodes and view the output files in the folder ~/graphlabapi/release/toolkits/collaborative_filtering/ 
The algorithm operation is explained in detail here

cd /some/ns/folder/
mkdir smallnetflix 
cd smallnetflix/ 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate 

Now run GraphLab:

mpiexec -n 2 /path/to/als --graph /some/ns/folder/ --max_iter=3 --ncpus=1

Where -n is the number of MPI nodes, and --ncpus is the number of deployed cores on each MPI node.

Note: this section assumes you have a network storage (ns) folder where the input can be stored.
Alternatively, you can split the input into several disjoint files, and store the subsets on the cluster machines.

Note: Don't forget to change /path/to/als and /some/ns/folder to your actual folder path!

Sunday, October 21, 2012

Deploying a GraphLab cluster on EC2

I got the following instructions from my collaborator Jay (Haijie Gu)who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.
This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

Note: the latest version of this tutorial has moved to here: http://graphlab.org/tutorials-2/graphlab-on-ec2-cluster-quick-start/

Step 0: Requirements

1) You should have Amazon EC2 account eligible to run on us-east-1a zone.
2) Find out using the Amazon AWS console your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
3) Download your private/public key pair (called here graphlab.pem)
4) Download Graphlab 2.1 using the instructions here.

Step 1: Environment setup

Edit your .bashrc or .bash_profile (remember to source it after editing)
export AWS_ACCESS_KEY_ID=[ Your access key ]
export AWS_SECRET_ACCESS_KEY=[ Your access key secret ]

Step 2: Start the cluster

$ cd ~/graphlabapi/scripts/ec2 $ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey -z us-east-1a -s 1 launch launchtest
 (In the above command, we created a 2-node cluster in us-east-1a zone. -s is the number of slaves, and launch is the action, and launchtest is the name of the cluster) only once when starting the image.

Step 2.2: Start Hadoop (mandatory when using HDFS)

This operation is needed when you want to work with HDFS
$  ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey start-hadoop launchtest

Step 3: Run alternating least squares demo

This step runs ALS (alternating least squares) in a cluster using small netflix susbset.
It first downloads the data from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, copy it into HDFS, and run 5 alternating least squares iterations:

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey als_demo launchtest 

After the run is completed, you can login into the master node and view the output files in the folder ~/graphlabapi/release/toolkits/collaborative_filtering/ The algorithm and exact format is explained here.

Step 4: shutdown the cluster

$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k grpahlabkey destroy launchtest

Advanced functionality:

Step 5: Login into the master node

$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey login launchtest

Step 6: Manual building of GraphLab code

On the master:

 cd ~/graphlabapi/release/toolkits hg pull; hg update;
/* Sync the binary folder to slaves */ 
 cd ~/graphlabapi/release/toolkits;  ~/graphlabapi/scripts/mpirsync 

 /* Sync the local dependency folder to slaves */ cd ~/graphlabapi/deps/local; ~/graphlabapi/scripts/mpirsync

Manual run of ALS demo

 Login into the master node
cd graphlabapi/release/toolkits/collaborative_filtering/ 
mkdir smallnetflix 
cd smallnetflix/ 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate 
cd .. 
hadoop fs -copyFromLocal smallnetflix/ / 
mpiexec -n 2 ./als --matrix hdfs://`hostname`/smallnetflix --max_iter=3 --ncpus=1


Known Errors:
Starting the dfs: namenode running as process 1302. Stop it first. 
localhost: datanode running as process 1435. Stop it first. 
ip-10-4-51-142: secondarynamenode running as process 1568. Stop it first. 
Starting map reduce: jobtracker running as process 1647. Stop it first. 
localhost: tasktracker running as process 1774. Stop it first. 

 Solution: Kill hadoop and restart it again using the commands:

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey stop-hadoop launchtest

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey start-hadoop launchtest

12/10/20 13:37:18 INFO ipc.Client: Retrying connect to server: domU-12-31-39-16-86-CC/ Already tried 0 time(s).

Solution: run jps to verify that one of the Hadoop nodes failed.

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey jps launchtest

> jps
1669 TaskTracker
2087 Jps
1464 SecondaryNameNode
1329 DataNode
1542 JobTracker
In the above example, NameNode is missing (not running).  Stop hadoop execution using stop-hadoop command line.

mpiexec was unable to launch the specified application as it could not access
or execute an executable:

Executable: /home/ubuntu/graphlabapi/release/toolkits/graph_analytics/pagerank
Node: domU-12-31-39-0E-C8-D2

while attempting to start process rank 0.

Executable is missing. Run update:


# A fatal error has been detected by the Java Runtime Environment:
#  SIGILL (0x4) at pc=0x000000000056c0be, pid=1638, tid=140316305243104
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [als+0x16c0be]  graphlab::distributed_ingress_base<vertex_data, edge_data>::finalize()+0xe0e
# An error report file with more information is saved as:
# /home/ubuntu/graphlabapi/release/toolkits/collaborative_filtering/hs_err_pid1638.log
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

1) Update the code:

$./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey update launchtest

2) If the problem still persists submit a bug report to GraphLab users list.

bickson@thrust:~/graphlab2.1/graphlabapi/scripts/ec2$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey login launchtest
ERROR: The environment variable AWS_ACCESS_KEY_ID must be set

Need to set environment variables, as explained in step 1.

Friday, October 19, 2012

Spotlight: Quid.com

Yesterday I stumbled upon this company: quid.com
Like quantified.com it is a company initiated by a few physicists who like to analyze data and to find some structure in the data. When looking at their website I found this interesting talk: "Sean Gourley on the mathematics of war" from TED: At first it looked like a very exciting intuition. But in a second thought the power low structure is not that surprising. For example, recall all the times you ate in a restaurant in the last 5 years. Typically, you eat in a small group, maybe yourself or with your wife. Occasionally you take out a few friends. Quite rarely you go out with 15 people out for dinner, maybe for your grandfather birthday or something. So it is quite clear that most times you go in small groups and a few times in a large group. With this intuition will you be able to get funding from the pentagon? Probably not..

Monday, October 15, 2012


Yesterday I learned from my collaborator Joey Gonzalez about InfiniteGraph software from Objectivity. It is both a distributed graph database and an analytics software on top of graphs.

I quickly looked through their website and found this hilarious video:

It seems InifiniteGraph has a great popularity in the defense/security sectors.
While the basic barebones version is free, I hear that the full features are rather expensive.

Here is a video with a more comprehensive technical description of the architecture:

Machine Learning PostDoc Positions in Europe

I got this from my collaborator Erik Eurell. I love the Nordic countries
this could be a great experience for anyone who wants to relocate for a year or two:

Call for Applications: Several Postdoc / Research Associate positions in
the Finnish Center of Excellence in Computational Inference COIN

Call deadline 1 November 2012, 3:00 pm EET

Keywords: Machine learning, computational statistics, Bayesian statistics,
information-theoretic learning, satisfiability checking, proactive
interfaces, computational biology

The Finnish Center of Excellence in Computational Inference Research
(COIN, Aalto University, University of Helsinki, Helsinki Institute for
Information Technology HIIT) announces several postdoc or research
associate positions in Computational Inference. The objective of COIN,
started on 1 January 2012, is to push the boundaries of inference in the
data-rich world of today and tomorrow. The COIN consortium includes the
creators of world standards and software such as Independent Component

Six postdocs / research associates were hired at the start of COIN, and we
now seek several additional new recruitments to complete our team.
Successful candidates will work on fundamental questions of inference and
in applications in Intelligent Information Access, Computational Molecular
Biology and Medicine, Computational History, Computational Climate,
Computational Neuroscience and other directions yet to be determined.
Applicants with suitable background in machine learning, mathematics,
statistics, computational logic, combinatorial optimization or statistical
physics are encouraged to apply.

COIN is a joint research center of the Aalto University School of Science,
Department of Information and Computer Science, the University of
Helsinki, Department of Computer Science and Department of Mathematics and
Statistics, and Helsinki Institute for Information Technology HIIT. COIN
studies the following research topics:

C1: Learning models from massive data

C2: Learning from multiple data sources

C3: Statistical inference in highly structured stochastic models

C4: Extremely fast inference

F1: Intelligent information access

F2: Computational molecular biology and medicine

In the four core methodological challenges C1–C4 we develop methods for
transforming the data produced in the current data revolution into useful
information. In COIN F1 Intelligent Information Access flagship the
challenge is to make use of massive interrelated information sources,
whether in everyday life or in science, and select what information to
present to the user. The inference needs to be done on-line, learning
relevance from the user’s responses. In COIN F2 Computational Molecular
Biology and Medicine flagship we develop methods for maximally utilizing
the novel measurement databases and structured stochastic models in making
data-driven biology cumulative. We especially welcome applications from
researchers who have solid knowledge and research background in one or
more of the four core methodological challenges C1–C4 and experience and
vision on applying those methodologies on either one of the two flagship
applications F1 and F2 or the other areas referred to above or of interest
to COIN.

The duration of the fixed term contract period is three years at most,
starting on 1 December 2012 at the earliest. In addition to personal
research work, the postdoc or research associate is expected to
participate in specialist teaching and student supervision in the
indicated areas of research, in collaboration with other COIN members.
Good command of English is a necessary prerequisite. The salary level for
beginning postdoctoral researchers in COIN is typically between 3200 and
3600 €/month, depending on experience and qualifications. The contract
includes occupational health benefits and Finland has a comprehensive
social security system.

Application procedure

The positions will remain open until filled. Deadline for the call is 1
November 2012, 3 PM EET. Please send an email to registry@aalto.fi,
indicating that you are applying for a postdoc position at COIN,
containing in one pdf file the following information:

• Your contact information

• Names and contact information of two senior academics available for

• A research statement of at most five pages outlining planned work and
connections to the four core methodological challenges C1–C4 and the two
flagship applications F1–F2.

• Curriculum vitæ.

• List of publications, with pointers to openly available online versions
of at most three of the most relevant publications.

• Degree certificate of the PhD degree, including a transcript of the
doctoral studies. In case the doctoral degree is still pending, an
up-to-date study transcript and a plan for completion of the degree should
be provided.

In addition to the application sent to Aalto registry, please fill in a
separate application form at

Candidates should also arrange for reference letters from the two
indicated senior academics to be sent separately to HR Coordinator, Mr.
Stefan Ehrstedt, e-mail firstname.lastname@aalto.fi by the deadline of 15
November 2012. Shortlisted candidates may be invited for an interview. In
the review process, particular emphasis is put on the quality of the
candidate’s previous research and international experience, together with
the substance, innovativeness, and feasibility of the research plan, and
its relevance to COIN’s mission. The candidate must have completed his or
her PhD degree before the start of the contract period, and efficient and
successful completion of studies is considered an additional merit.

Further information

For further information, please visit http://research.ics.aalto.fi/coin
and contact:

• HR Coordinator, Mr. Stefan Ehrstedt, e-mail firstname.lastname@aalto.fi
(application process)

• Director of COIN, Prof. Erkki Oja, e-mail firstname.lastname@aalto.fi

• Deputy Director of COIN, Prof. Samuel Kaski, e-mail

• Prof. Erik Aurell, e-mail firstname.lastname@aalto.fi

• Prof. Jukka Corander, e-mail firstname.lastname@helsinki.fi

• Dr. Jorma Laaksonen, e-mail firstname.lastname@aalto.fi

• Prof. Petri Myllymäki, e-mail firstname.lastname@cs.helsinki.fi

• Prof. Ilkka Niemelä, e-mail firstname.lastname@aalto.fi

About Aalto University

Aalto University is a new university created in 2010 from the merger of
the Helsinki University of Technology TKK, the Helsinki School of
Economics, and the University of Art and Design Helsinki. The University’s
cornerstones are its strengths in education and research, with 20,000
basic degree and graduate students, and a staff of 4,500 of whom 350 are
professors. For further information, see http://www.aalto.fi/en.

About University of Helsinki

The University of Helsinki is the most versatile institution of science,
education, and intellectual renewal in Finland, a pioneering builder of
the future. It operates to actively promote the wellbeing of humanity and
a just society. The University of Helsinki is one of the best
multidisciplinary research universities in the world. The high-quality
research carried out by the university creates new knowledge for educating
diverse specialists in various fields, and for utilisation in social
decision-making and the business sector. For further information, see

Additional postdoc position I got from Florent Krzakala, my former collaborator in the Evergrow project:

I would like to invite applications for two postdoctoral positionsfunded by the European Research Council Starting Grant program, in thecontext of the project SPARCS (Statistical Physics Approach toReconstruction in Compressed Sensing) in my group in ESPCI in Paris.The appointments are intended to start in the fall of 2013 (or sooner)and will be for 2 years with a possibility of extension on 3 years.
The candidates can come from different areas (Statistical Physics,Signal Processing, Applied Mathematics, Error correcting codes,Information Theory, Inference and Machine learning) and are expectedto bring their expertise. Successful candidates will thus conduct a vigorous research program within the scope of the project, and areexpected to show independence and team working attitude at the sametime.
For more information, please visit the post-doc announcement page:http://www.pct.espci.fr/~florent/postdoc.html 

Friday, October 12, 2012

Spotlight: Trifacta

The all connected Prof. Joe Hellerstein from Berkeley does not rest for a minute. He has now a new startup company called Trifacta.  What they do is very secret. But as hint you can look at my related older blog post.

This week it was announced Trifacta raised 4.3M $:

Trifacta further has one for the most impressive management and advisory boards out there. I am looking forward to hearing more about the company soon. By the way, Trifacta is hiring. If you apply tell Joe I sent you.. :-)

Tuesday, October 9, 2012

Misc Updates

How big is Facebook data? I got this update from my collaborator Aapo Kyrola:

This morning, there are more than one billion people using Facebook actively each month....

Facebook has also shared a number of key metrics with users along with the announcement, including 1.13 trillion Likes since its 2009 launch (note that this is actually probably higher, since the official press document contained a note accidentally left in from an editor about rolling back the number because of info shared previously with Businessweek), 140.3 billion friend connections, 219 billion photos uploaded, 17 billion location-tagged posts and 62.6 million songs played some 22 billion times. http://techcrunch.com/2012/10/04/facebook-tops-1-billion-monthly-users-ceo-mark-zuckerberg-shares-a-personal-note/

I got the following 10 patterns for research in Machine learning from Tianqi Chen. The list is by John Langford in his blog:
  1. Separate code from data.
  2. Separate input data, working data and output data.
  3. Save everything to disk frequently.
  4. Separate options from parameters.
  5. Do not use global variables.
  6. Record the options used to generate each run of the algorithm.
  7. Make it easy to sweep options.
  8. Make it easy to execute only portions of the code.
  9. Use checkpointing.
  10. Write demos and tests.
 Following John's good practice, Tianqi used some of those ideas for competing in KDD CUP. And here is a summary of his experience. Specifically, Tianqi uses Makefiles for managing multiple and complex execution scripts.

Wednesday, October 3, 2012

The 10 recommender system metrics you should know about

I got the following interesting email from Denis Parra, a PhD student @ University of Pittsburgh:

Following the post on evaluation metrics in your blog, we would be glad to help you testing new evaluation metrics for GraphChi. Not long ago (this year, actually), with Sherry we wrote a book Chapter on recommender systems focusing on sources of knowledge and evaluation metrics. In section 7.4 we explain some of these evaluation metrics.

For instance, we describe a metric that has become to be popular for evaluating recommendations based on implicit feedback  called MPR (Mean Percentile Ranking) that some authors call Percentile Ranking. This is the method used by Hu et al. in "Collaborative filtering for implicit feedback datasets" (2008) and by Quercia et al. in "Recommending social events from mobile phone location data" (2010) 
PS: In case you want to cite the book chapter, you can use
  chapter = {7},
  title   = {Recommender Systems: Sources of Knowledge and Evaluation Metrics},
  editor  = { J.D. Vel{\' a}squez et al. (Eds.)},
  author  = {Parra, D. and Sahebi, S. },
  booktitle = {Advanced Techniques in Web Intelligence-2: Web User Browsing Behaviour and Preference Analysis},
  publisher = {Springer-Verlag},
  address   = {Berlin Heidelberg},
  pages   = {149–-175},
  year    = {2013}

I think this book chapter is going to become highly useful overview for anyone who is working on recommender system. As a "teaser" until the book comes out, I asked Denis to shortly summarize the different metrics by giving a reference to each one. The book itself will include much detailed explanation of each metrics and its usage.

Denis was very kind to provide me the following list to share in my blog:

Though many of these metrics are described in the seminal paper "Evaluating collaborative filtering recommender systems" by Herlocker et al.,
this is a subset of an updated list of metrics used to evaluate Recommender Systems in the latest years:

For rating

MAE (Mean Absolute Error)

Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: 14th Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998)

MSE (Mean Squared Error)

Shardanand, U., Maes, P.: Social information filtering: algorithms for automating word of mouth. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 1995, pp. 210–217. ACM Press/Addison-Wesley Publishing Co., New York (1995)

RMSE (Root mean Squared Error)

Bennett, J., Lanning, S., Netflix, N.: The netflix prize. In: KDD Cup and Workshop in Conjunction with KDD (2007)

Evaluating lists of recommendation (based on relevancy levels)


Le, Q. V. & Smola, A. J. (2007), 'Direct Optimization of Ranking Measures', CoRR abs/0704.3359


Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (RecSys '10)

MAP: Mean Average Precision

Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

nDCG: normalized Discounted Cummulative Gain

J¨arvelin, K., Kek¨al¨ainen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002)


Intra-list Similarity

Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 22–32. ACM, New York (2005)

Lathia's Diversity

Lathia, N., Hailes, S., Capra, L., Amatriain, X.: Temporal diversity in recommender systems. In: Proceeding of the 33rd International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 210–217. ACM, New York (2010)

Implicit Feedback

Mean Percentage Ranking

Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. IEEE Computer Society, Washington, DC (2008)

User-Centric Evaluation Frameworks

Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A pragmatic procedure to support the user-centric evaluation of recommender systems. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys 2011, pp. 321–324. ACM, New York (2011) Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys 2011, pp. 157–164. ACM, New York (2011)

Interesting large scale dataset: D4D mobile data

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest
ever released mobile phone datasets: datasets based on 2.5 billion records,
calls and text messages exchanged between 5 million anonymous users over 5
Participation rules: http://www.d4d.orange.com/ 
Description of the datasets: http://arxiv.org/abs/1210.0137 
The "Terms and Conditions" by Orange allows the publication of results
obtained from the datasets even if they do not directly relate to the
Cash prizes for winning participants and an invitation to present the results
at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT
Deadline: October 31, 2012

Some more information about this dataset:
The data collection took place in Cote d’Ivoire over a five-month period, from December 2011 to April 2012. The original dataset contains 2.5 billion records, calls and text messages exchanged between 5 million users. The customer identifier was anonymized by Orange Cote d’Ivoire. All subsequent data processing was completed by Orange Labs in Paris.
We will release four datasets in order to offer a large spectrum of possible analyses:
  • Aggregate communication between cell towers;
  • Mobility traces: fine resolution dataset;
  • Mobility traces: coarse resolution dataset;
  • Communication sub-graphs.

Tuesday, October 2, 2012

Item based similarity with GraphChi

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to.  Following my mega collaborator Justin Yan's advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

The simplest distance metric I have implemented is Jaccard distance. The distance of items i and j is computed as follows:
         wi = number of users who watched movie i
         wj = number of users who watched movie j
         wij = number of users who watched both movie i and movie j
         Dij =  wij / ( wi + wj - wij )

It is clear that Dij is a number between zero and one.
Additional distance functions are found here.

As always, I am looking for brave beta testers who want to try it out!
For full Netflix data, it takes around 1200 seconds to compute distances of around 130M item pairs. (Around 1000 item pairs in a second, using a 24 core machine, with only 800MB memory used).

Ping me if you are interested in other distance metric so I could add them as well.

How to try it out:

1) Follow GraphChi installation instructions steps 1-4.
2) Download smallnetflix_mm
3) Run itemcf on smallnetflix data using:

bickson@thrust:~/graphchi$ > ./toolkits/collaborative_filtering/itemcf --training=smallnetflix_mm --nshards=1 --clean_cache=1 --K=10 --quiet=1
WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com
[training] => [smallnetflix_mm]
[nshards] => [1]
[clean_cache] => [1]
[K] => [10]
[quiet] => [1]

 === REPORT FOR sharder() ===
edata_flush: 0.005163s (count: 13, min: 0.000335s, max: 0.000534, avg: 0.000397154s)
execute_sharding: 0.273818 s
finish_shard.sort: 0.065508 s
preprocessing: 1.05835 s
shard_final: 0.15078 s
app: sharder
Total item pairs compared: 6325948 total written to file: 35551 pairs with zero distance: 124609
Sorting output file smallnetflix_mm.out0
Sorting output file smallnetflix_mm.out1
Sorting output file smallnetflix_mm.out2
Sorting output file smallnetflix_mm.out3
Sorting output file smallnetflix_mm.out4
Sorting output file smallnetflix_mm.out5
Sorting output file smallnetflix_mm.out6
Sorting output file smallnetflix_mm.out7
Merging sorted files:
File written: smallnetflix_mm-topk
Total lines: 35551

Now let's examine the output file

bickson@thrust:~/graphchi$ 3561 3076 0.23296110332
3561 1990 0.270156830549
3561 1794 0.218252897263
3561 1788 0.227363973856
3561 1598 0.211793571711
3561 1536 0.233875498176
3561 1507 0.224534928799
3561 557 0.237991556525
3561 411 0.22253626585
3561 338 0.237756416202

For each item we get exactly K=10 similar items. (The item are sorted in reverse order).
The format is rather simple. In each row we have:
<item A> <item B> <similarity>

Command line arguments
--training - input file name in sparse matrix market format
--min_allowed_intersection=XX - filter out item pairs with less than XX users who rated them  jointly.
--quiet=1 run with less verbose traces
--nshards=1 ## mandatory argument
FOR itemcf:  --distance=XX, 0 = Jaccard index, 1=AA, 2=RA, 3=Aiolli

Useful GraphChi command line arguments:
execthreads XX - set the number of execution threads
membudget_mb XX - fix the size of used memory to XX mb

Additional metrics:

Adamic/Adar (AA) (itemcf):

Just got an email from Timmy Wilson, our man in Ohio: 
"There are a lot of similarity functions out there -- Adamic / Adar is another good one "

I looked it up an indeed it is a very simple cost function: 
Distance_item_i_j = sum over users which rated both items i j (1 / log (user_rating) )
see equation (2) in the paper: http://arxiv.org/abs/0907.1728 "Role of Weak Ties in Link Prediction of Complex Networks". The equation gives larger weight to users which rated both items but have relatively small number of rated items.

I have also added the RA cost function, which is same like AA but without the log.

Pearson Correlation (itemcf)

I got a request from additional reader "silly snail" to implement Pearson Correlation as well. Let m be a vector of size M (users) which hold the user rating mean.
Let a be a sparse vector holding user rating for item a, and the same for item b.

Pearson_correlation_i_j = ((a - mean)' * (b-mean)) / (stddev(a) * stddev(b))

And stddev(a) = sum((a-mean).^2) / (n - 1) 
where n = length(a).

To run pearson correlation, use the program ./toolkits/collaborative_filtering/preason

A toy example:

Prepare the file "stam" with the content:
%%MatrixMarket matrix coordinate real general
3 4 9
1 1 3
1 2 4
2 3 1
2 4 1
2 1 2
3 1 1
3 2 2
3 3 3
3 4 4

Namely there are 3 users, 4 items, and 9 ratings. 
Now run pearson correlation:
./toolkits/collaborative_filtering/pearson --training=stam execthreads 1 

Lets examine the output file:
head stam.out0
2 1 0.337405

Namely, the pearson correlation between items 1 and 2 is 0.337405.
Now let's do the same computation in matlab:

>> a=full(mmread('stam')); % load the matrix into memory
a =

     3     4     0     0
     2     0     1     1
     1     2     3     4
>> means=mean(a') % compute the item vectors mean

means =

    1.7500    1.0000    2.5000

>> A=a(:,1)-means'  % compute item 1 vector minus mean

A =


>> B=a(:,2)-means'  % compute item 2 vector minus mean

B =

>> dist = A'*B  % multiply the result

ans =

>> stddeva=sum(A.^2)/2 % compute stddev item 1

stddeva =


>> stddevb=sum(B.^2/2)% compute stddev item 2

stddevb =

>> dist/(stddeva*stddevb) % compute pearson correlation

ans =


Luckily we got the same result! namely pearson correlation between the two first items is 0.3374.

Cosine Distance (itemcf)

Manhattan Distance (itemcf2)

See http://en.wikipedia.org/wiki/Taxicab_geometry

Log Similarity Distance (itemcf2)

Chebychev Distance (itemcf2)


Tanimoto Distance (itemcf2)

See http://en.wikipedia.org/wiki/Jaccard_index

Aiolli's method (itemcf)

See the paper: F. Aiolli, A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Preference Learning: Problems and Applications in AI (PL-12), ECAI-12 Workshop, Montpellier. pdf

Slope One Recommender (itemcf2)

Is described in A Programmer's Guide to Data Mining , page 18.

How to compute recommendations out of similarities?

For computing top K recommendations for each user, you can use the itemsim2ratings utility
as explained here. The input file for itemsim2rating is the similarity matrix computed by itemcf.

Additional reading:

Matrix factorization based collaborative filtering with GraphChi.
Case study: million songs dataset