Tuesday, November 29, 2011

Oracle Labs Report on GraphLab


Recently we had some interesting discussions with Eric Sedlar, Technical Director, Oracle Labs. Oracle examined GraphLab as a potential framework for implementing graph algorithms and assigned Sungpack Hong, a researcher at Oracle Labs to write a tech report on this subject.

A few days ago I got a copy of this seven pages tech report which is called "A brief report on Graphlab and Green-Marl". I asked for permission from Eric and Sungpack to discuss their main finding here.  Their full tech report may be published by them at the later date.

According to Sungpack there are 4 characterizing aspects of algorithms which Graphlab supports:

  1. Algorithms can be described in a node-centric way; same computation is repeatedly performed on every node.
  2. Significant amounts of computations are performed on each node.
  3. The underlying graphs have large diameter.
  4. Determinism can be neglected.
Here is my reaction to those four points:
  1. Agreed.
  2. Agreed. 
  3. This is not necessarily correct - you can use Graphlab for any Graph as long they are sparse.
  4. Actually, we support round robin scheduling, so it is quite easy to have full sweeps over all nodes. Many of the matrix factorization algorithms we implemented fall into this category - they are completely deterministic (besides of maybe some random initialization of the starting state).
Next, some issues are discussed:
  1. Programmability: user must restructure his algorithm in a node centric way.
  2. There is an overhead of runtime system when the amount of computation performed at each node is small.
  3. Small world graphs: GraphLab lock scheme may suffer from frequent conflicts for such graphs.
  4. Determinism: the result of computed algorithm in Graphlab may become non-deterministic.
  5. Concerns about distributed execution.
Here is my thoughts on those points:
  1. This is a correct concern, not every algorithm is suitable for programming in GraphLab.
  2. Agreed.
  3. Agreed. In GraphLab v2 Joey is heading significant improvement which will allow factoring the update function in to several different computations to deal with this issue.
  4. Objection! As shown, for example, by our participation in the ACM KDD CUP contest this year, we got very high quality prediction results, that where completely deterministic. It is possible to implement a wide variety of deterministic algorithms in GraphLab.
  5. Agreed. Yucheng is heading the effort of completely rewriting the distributed engine and we believe significant performance improvements are possible in the distributed settings. 
Finally, the following conclusion is given:
... I believe GraphLab still is a valuable back-end for Green-Marl (DB: their programming language) .. it is a very natural to use GraphLab as one of Green-Marl's back-ends....  GraphLab and Green-Marl can be helpful to each other... Finally it will be interesting to wait .. for GraphLab2. A fair comparison of distributed graph processing ... would be meaningful then.

Overall, I wanted to thank Eric and Sungpack for the great and serious effort they did in evaluating GraphLab! While I am of course biased, it is important to get unbiased evaluation of third parties.

Spotlight: Dmitriy Golovashkin - Oracle and the R project


A few days ago I got the following note from Dmitry, a principal stuff member at Oracle:

Hi Danny,

I found out about GraphLab just two days ago.

I was working on a MapReduce based QR factorization and whilst searching web for references, found your blog & GraphLab. No question, I am planning to learn more about the project. Looks very exciting!

In general, our group is focusing on in-database and Hadoop based
data mining and statistical algorithms development.
R http://www.r-project.org/ is a big part of it.

Kind regards,
DG

As always I am absolutely thrilled for getting my readers feedback! I asked Dmitry if he can share some  more insight about R project and here is what he wrote:

R is huge in data mining and statistical camps.
The number of contributed packages is staggering, it is amongst the most complete and feature rich environments for statistical and data mining computing.
Another very important observation concerns the quality of some of the contributed packages: outstanding work & implementation.

The biggest problem with R has to do with its inherent data storage model: everything must be stored in memory and most algorithms are sequential.
For instance the notion of a matrix in R is captured in the following C one liner:
double* a = (double*) malloc(sizeof(double) * nElements));

It is possible to build R with a vendor-supplied matrix packages (BLAS and LAPACK) and thus have multithreaded matrix computations in R (which helps a lot).
However if the input does not fit into memory, then it is somewhat problematic to run even the simplest algorithms.

We enable R folks to carry out computations directly on the database data (no need to move the data out). The in-memory limitation has been lifted for some algorithms (not all of course).

More is here
http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html

Kind regards,
DG

We definitely agree that R is very useful statistical package. In fact, one of our users, Steve Lianoglou from Weill Cornell Medical Collage ported our Shotgun solver package to R.   And here is an excerpt from Steve's homepage which summarizes his ambivalent relation to R:

I have a love/hate relationship with R. I appreciate its power, but at times I feel like I'm "all thumbs" trying use it to to bend the computer to my will (update: I actually don't feel like this anymore. In fact I'm quite comfortable in R now, but try not to get too frustrated if you're not yet ... it only took me about 6 months or so!).
If that feeling sounds familiar to you, these references might be useful.
  • The R Inferno [PDF]. "If you are using R and you think you're in hell, this is a map for you." I stumbled on this document after using R for about 8 months or so and I could still, sympathize with that statement.
  • An R & Bioconductor Manual by Thomas Girke




Anyway if you are a reader of this blog or a Graphlab user - send me a note!

Tuesday, November 22, 2011

Installing GraphLab on Fedora Core 16

FOR LAZY USERS: You can use Amazon EC2 image: ami-1510d87c
As instructed here.

0) Login into your FC 16 machine.
On amazon EC2 - you can launch instance ami-0316d86a

1) Installing libboost
yum install boost boost-devel

2) Installing itpp
Note: For FC 15 and earlier it is easy to setup itpp, see packages here: http://koji.fedoraproject.org/koji/packageinfo?packageID=2142 and explanation on http://sourceforge.net/apps/wordpress/itpp/download/
Using the command:
yum install itpp itpp-devel itpp-doc

Here I will explain how to install itpp manually, for FC 16

yum install lapack autoconf autogen libtool
ln -s /usr/lib64/liblalpack.so.3 /usr/lib64/liblapack.so
ln -s /usr/lib64/libblas.so.3 /usr/lib64/libblas.so

yum groupinstall "Development Tools"
wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
tar xvzf itpp-4.2.tar.gz
cd itpp-4.2
./autogen.sh 
./configure --enable-debug --without-fft --with-blas=/usr/lib64/libblas.so.3  --with-lapack=/usr/lib64/liblapack.so.3 CFLAGS=-fPIC CXXFLAGS=-fPIC  CPPFLAGS=-fPIC
make && make install
Note: future version of lalpack may have different library name, check your /usr/lib64
dir for the correct lapack/blas lib names.

3) Install Mercurial
yum install mercurial

4) Install cmake
yum install cmake

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
./configure --bootstrap --itpp_include_dir=/usr/local/include/  --itpp_dynamic_link_dir=/usr/local/lib/libitpp.so  --itpp_lapack_dir=/usr/lib64/
cd graphlabapi
cd release/
make -j4

7) Test Graphlab
cd tests
export LD_LIBRARY_PATH=/usr/local/lib/
./runtests.sh

8) Optional (but recommended): install Octave.
yum install octave

Sunday, November 20, 2011

Installing GraphLab on RedHat Enterprise Server 6.1

0) Login into your Red Hat machine Enterprise 6.1 machine (64 bit).
On amazon EC2 - you can launch instance ami-31d41658

1) Installing libboost
yum install boost boost-devel
yum groupinstall "Development Tools"

2) Install Mercurial
yum install mercurial

3) Install cmake
yum install cmake

4a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 4b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

5) configure and compile
cd graphlabapi
./configure 
cd release/
make -j4


Saturday, November 19, 2011

Installing GraphLab on Gentoo Linux



For Lazy Users: simply use  ami-d9e820b0  Amazon EC2 image. (More detailed instructions)


Detailed instructions.
Login into your Gentoo Linux machine (I used preinstalled Amazon EC2 image: AMI-e234c78b).

1) Install itpp using:
emerge lapack
USE="blas lapack debug doc" emerge itpp

2) Install boost using:
emerge boost

3) Install mercurial using:
emerge mercurial

4) Install cmake from sources:
Note: I was not able to execute "emerge cmake" because of some versioning problem in
the Gentoo repository, so I had to install it manually from sources as follows:
get http://www.cmake.org/files/v2.8/cmake-2.8.6.tar.gz
tar xvzf cmake-2.8.6.tar.gz
cd cmake-2.8.6
./configure
make
make install

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.

copy the command string: "hg clone..." and execute it in your Gentoo shell.

or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
cd graphlabapi
./configure --bootstrap --yes
cd release/
make -j4

7) Test Graphlab
cd tests
./runtests.sh

8) Optional: install Octave.
If you don't have access to Matlab, Octave is an open source replacement. Octave is useful for preparing input formats to GraphLab's collaborative filtering library and reading the output.
You can install Octave using the command:
emerge octave

Let me know how it went!

Thursday, November 17, 2011

SpotLight: Tianqi Chen - SVDFeature collaborative filtering project


Here is an interview With Tianqi Chen, the winner in the 3rd place in ACM KDD CUP track 1 this year. Since he got two places ahead of us (Lebusishu team), I definitely appreciate his achievement! Tianqi is one of the main collaborators of the SVDFeature project. In a nutshell, SVDFeature is an efficient c++ implementation of stochastic gradient descent that allows for additional feature tuning by the user. It can scale to large models since it keeps only part of the problem in memory. However, it is not fully parallel yet.

1) First I wanted to congratulate you for your great achievement in the KDD CUP this year. In retrospective, can you explain what are the crucial features of SVDFeature project that made it so successful ?

One of the key factor of winning is ensemble predictor of course, our teammates in HKUST did a great job in that. On the other hand, SVDFeature creates the chance to build a powerful single predictor that takes many information into account, we report the best single method(RMSE=22.12 without blending) in track1 using SVDFeature. The most interesting thing of SVDFeature is user can build variants of matrix factorization model by feature defining. In our final predictor of track1, we incorporate user temporal information, implicit feedback, item neighborhood information, taxonomy information etc. Many famous models such as time-SVD++, SVD with neighborhood can also be implemented by defining user temporal related feature, and neighborhood feature. Another important thing of SVDFeature is we don't try to load all the training data into the memory, which suits the large-scale CF data such as KDDCup(of course we still need the memory for storing the model). For example, we can use less than 2GB memory to get comparable performance in KDDCup track1 and track2.

2) What is the project focus?

We really want to make the project as an generic tool for developing Informative Collaborative Filtering and Ranking Algorithms. By informative, we mean to build matrix factorization algorithm that takes available information into account( taxonomy information, user temporal, neighborhood, social network, users' previous click history or browsing history etc.). Similar context exists long in traditional classification tools such as LIBSVM, SVMLight, SVMRank. These tools takes in feature vectors and give predict using the information given in the feature-vector. We want SVDFeature to be the counterpart of the feature-based predictors in Collaborative Filtering and Ranking. Users can define features to describe user(e.g user's click history and browsing history), item(e.g taxonomy information), and global effect(e.g neighborhood effect) to build actually new variants of matrix factorization. In that sense, SVDFeature is suitable for those who want to develop new recommendation algorithms for feature define. It's also suitable for building context-aware recommendation algorithms.

3) Who are the contributors?

There are four authors in the project page, I am the major author of the project. We do welcome comments and feedback on the project.

4) What is the open source model?

The project is released under Apache License, Version 2.0. The toolkit doesn't depends on extra libraries and compiles under both linux(g++) and windows(MSVC).

5) What are the advantages of using the project?

(1) Implement existing variants of Matrix Factorization( time-SVD++, neighborhood SVD) and composite of them.

(2) Build new CF algorithms( for ranking or for rate prediction) that makes use of extra information by defining new features.

(3) Working with a team building feature components independently and merge up together to see the result. Since the input only involves different kinds of feature, we can have one guy working on user temporal feature using python, and another one generation item neighborhood feature using C++. The generated features can be used to build independent models or merge up to build a hybrid predictor.

(4) Large-scale data handling, the toolkit works for existing large-scale datasets such as kddcup, netflix with frugal usage of memory and good speed, if we build features for basic matrix factorization in KDDCup track1, here are some statistics for one round over the dataset: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz. 274 sec(64 factors), 357sec(128 factors) , the memory cost is within 2GB.

6) Are there some algorithms which will be hard to express using your framework?

Yes, to be honest, currently SVDFeature supports feature-based matrix factorization model, that involves user, item and global feature. Which do covers large-range of existing algorithms. It currently doesn't cover the models that involves multiple independent factorization parts. For example, tensor time bias, that has time vs user/item in addition to user vs item factor. It's not hard to add additional tweaks to the code to add some specific features. We do welcome requests of specific features. On the other hand, we want to think about a more abstract model that involves configuration of multiple factorization parts, that's a future direction of the toolkit.

7) Can you point to some tutorials for people who want to try it out? Conceptually, what is needed from the users to be defined?

We have a detailed user reference manual and technical report about the model that comes with the project archive. The user need to define a feature-file similar to the common SVM format and provide a configure file that specifies the parameters such as number of latent factors, learning_rate. etc. We have a demo folder in the toolkit archive that have simple example scripts for movielen data. We also have a non-trivial example that shows how to do experiment on KDDCup track2 dataset. We also welcome questions from users.

8) Is the implementation parallel?

The training algorithm is not parallel, but we do create a independent thread to fetch the input data to ensure the speed, we also use SSE2 to optimize some vector operations.

9) Can you give some performance numbers for kdd cup data?

For the KDDCup track1, SVDFeature can build a model with RMSE 22.1x, we also did an afterwards experiments on KDDCup track2 ( see all the scripts and codes) that error=3.1x. The performance of SVDFeature really depends on the user's intelligence of feature defining, the kddcup track2 is a good example to show performance of different feature compositions.

10) What is the difference of SVDFeature to libfm?

(1) Yes, it's very related to FM, we do respect the work of libFM. We pointed it out in our related work in the tech-report.
(2) The difference in the model: feature-based matrix factorization distinguish different types of feature, it has factorization interaction between user feature and item feature, and linear only effect with the input of global feature. FM is more like a tensor-factorization style model, which enforce factorization interaction between every features. The distinguish of features allows linear only feature such as neighbourhood information, and ensures user temporal feature and user implicit feedback will only have factorization interaction with item features.
(3) Difference in toolkit: SVDFeature supports both rate prediction and feature-based collaborative ranking, it also has a fast algorithm for training with implicit feedback, large-scale data handling(it's very important if you define many features), and other minor differences: e.g detailed options for parameter tuning and various loss functions( hinge loss, logistic loss, square loss ) .
(4) Some examples that may tell these differences: try to implement SVD++ model, SVD with neighbourhood information, or see our non-trival experiment on KDDCup track2(http://apex.sjtu.edu.cn/apex_wiki/kddtrack2)
We call our model Feature-based Matrix Factorization because the idea descends more naturally from matrix factorization using features. We do not try to claim we proposed a freshly new feature-based model, the credits shall goes to FM. But we do try to make a model and a toolkit that's useful and can help doing informative collaborative filtering and ranking, and SVDFeature at least works better in some scenarios:)

News from 29 Nov, 2011:
Titanqi published ready-to-run experiment on KDDCup track1 in http://apex.sjtu.edu.cn/apex_wiki/kddtrack1

Wednesday, November 16, 2011

GAMP: Generalized Approximate Message Passing in GraphLab

About a week ago I was invited to visit Ohio State University by Phil Schniter. I heard about a lot of interesting research activities, that I will probably report in this blog in the future. Specifically, Phil is working on inference on very related linear models (relative to my work).

I heard from Phil about the GAMP method, an extension of Montanari's AMP (approximate message passing methods). GAMP is supposed to overcome some of the difficulties in AMP, namely Gaussian matrix assumption. Here is an excerpt from the GAMP wiki:
Generalized Approximate Message Passing (GAMP) is an approximate, but computationally efficient method for estimation problems with linear mixing. In the linear mixing problem an unknown vector, x, with independent components, is first passed through linear transform z=Ax and then observed through a general probabilistic, componentwise measurement channel to yield a measurement vector y. The problem is to estimate x and z from y and A. This problem arises in a range of applications including compressed sensing. Optimal solutions to linear mixing estimation problems are, in general, computationally intractable as the complexity of most brute force algorithms grows exponentially in the dimension of the vector . GAMP approximately performs the estimation through a Gaussian approximation of loopy belief propagation that reduces the vector-valued estimation problem to a sequence of scalar estimation problems on the components of the vectors x and z.
This project is intended to develop the theory and applications of GAMP. We also provide open source MATLAB software for others who would like to experiment with the algorithm.
Below is a simplified version of GAMP matlab code I got from phil. It is a highly simplified instance of the GAMPmatlab code, which includes a library of signal priors and matrix constructions, as well as stepsize control (to improve stability when the matrix has highly correlated columns). Automatic tuning of GAMPmatlab parameters is provided by the EM-BG-GAMP code.

% this m-file uses GAMP to estimate the matrix X (of dimension NxT) from the noisy matrix observation Y=A*X+W of dimension MxT.  here, A has independent Gaussian entries with matrix mean and variance (A_mean,A_var), W has iid Gaussian entries with scalar mean and variance (0,psi), and X has iid Bernoulli-Gaussian entries with scalar activity rate "lam" and with non-zero elements having scalar mean and variance (theta,phi).

 % declare signal parameters
 N = 1000;      % # weights     [dflt=1000]
 M = 400;   % # observations    [dflt=400]
 T = 10;   % # dimensions per observation   [dflt=10]
 lam = 0.01*ones(N,T);  % BG: prior probability of a non-zero weight [dflt=0.01]
 theta = 0*ones(N,T);  % BG: prior mean of non-zero weights  [dflt=0]
 phi = 1*ones(N,T);   % BG: prior variance of non-zero weights [dflt=1]
 SNRdB = 15;   % SNR in dB      [dflt=15]

 % declare algorithmic parameters
 iterMax = 50;   % maximum number of AMP iterations  [dflt=20]

 % generate true Bernoulli-Gaussian signal (i.e., "weights")
 B_true = (rand(N,T)>1-lam);   % sparsity pattern
 X_true = B_true.*(theta+sqrt(phi).*randn(N,T));% sparse weights

 % generate measurement (i.e., "feature") matrix
 A_mean = randn(M,N)/sqrt(M);       % element-wise mean 
 A_mean2 = A_mean.^2;
 A_var = abs(A_mean).^2;   % element-wise variance
 A_true = A_mean + sqrt(A_var).*randn(M,N); % realization

 % generate noisy observations
 psi = norm(A_true*X_true,'fro')^2/(M*T)*10^(-SNRdB/10);   % additive noise variance 
 Y = A_true*X_true + sqrt(psi).*randn(M,T);  % AWGN-corrupted observation

 % initialize GAMP
 X_mean = lam.*theta;  % initialize posterior means
 X_var = lam.*phi;  % initialize posterior variances
 S_mean = zeros(M,T);  % initialized scaled filtered gradient
 NMSEdB_ = NaN*ones(1,iterMax); % for debugging: initialize NMSE-versus-iteration

 % iterate GAMP
 for iter=1:iterMax,

   % computations at observation nodes
   Z_var = A_mean2*X_var;
   Z_mean = A_mean*X_mean;
   P_var = Z_var + A_var*(X_var + X_mean.^2);
   P_mean = Z_mean - Z_var.*S_mean;
   S_var = 1./(P_var+psi);
   S_mean = S_var.*(Y-P_mean);
% computations at variable nodes
R_var = 1./(A_mean2'*S_var + eps);
   R_mean = X_mean + R_var.*(A_mean'*S_mean);
   gam = (R_var.*theta + phi.*R_mean)./(R_var+phi);
   nu = phi.*R_var./(R_var+phi);
   pi = 1./(1+(1-lam)./lam .*sqrt((R_var+phi)./R_var) .* exp( ...
    -(R_mean.^2)./(2*R_var) + ((R_mean-theta).^2)./(2*(phi+R_var)) ));
   X_mean = pi.*gam;
   X_var = pi.*(nu+gam.^2) - (pi.*gam).^2;

   % for debugging: record normalized mean-squared error 
   NMSEdB_(iter) = 20*log10(norm(X_true-X_mean,'fro')/norm(X_true,'fro'));
 end;% iter

 % for debugging: plot posterior means, variances, and NMSE-versus-iteration
 subplot(311)
   errorbar(X_mean(:,1),sqrt(X_var(:,1)),'.');
   hold on; plot(X_true(:,1),'k.'); hold off;
   axis([0,size(X_true,1),-1.2*max(abs(X_true(:,1))),1.2*max(abs(X_true(:,1)))])
   ylabel('mean')
   grid on;
   title('GAMP estimates')
 subplot(312)
   plot(10*log10(X_var(:,1)),'.');
   axe=axis; axis([0,N,min(axe(3),-1),0]);
   ylabel('variance [dB]');
   grid on;
 subplot(313)
   plot(NMSEdB_,'.-')
   ylabel('NMSE [dB]')
   grid on;
   title(['NMSE=',num2str(NMSEdB_(iter),3),'dB '])
 drawnow;

Today I have implemented GAMP in GraphLab. The implementation in parallel is pretty straightforward. The main ovearhead is the matrix product with A (marked in red) and with A' (marked in green). Each of the variable and observation nodes is a graphlab node. There are two update functions, one for variable nodes and one for observation nodes. The code can be found in GraphLab linear solvers library.

Here is a snapshot from the implementation of the observation node update function:
void gamp__update_function(gl_types_gamp::iscope &scope,
         gl_types_gamp::icallback &scheduler) {

  /* GET current vertex data */
  vertex_data_gamp& vdata = scope.vertex_data();
  unsigned int id = scope.vertex();
  gamp_struct var(vdata, id);

  
/*
 *  computations at variable nodes
*/
   memset(var.R_var, 0, config.D*sizeof(sdouble));
   memset(var.R_mean, 0, config.D*sizeof(sdouble));
   for (unsigned int k=ps.m; k< ps.m+ps.n; k++){
      vertex_data_gamp  & other= g_gamp->vertex_data(k);
      gamp_struct obs(other,k);

      for (int i=0; i< config.D; i++){
        var.R_var[i] += pow(var.AT_i[k-ps.m],2) * obs.S_var[i];
        var.R_mean[i] += var.AT_i[k-ps.m] * obs.S_mean[i];
      }
  }

  for (int i=0; i< config.D; i++){
     var.R_var[i] = 1.0 / (var.R_var[i] + ps.gamp_eps);
     var.R_mean[i] = var.X_mean[i] + var.R_var[i] * var.R_mean[i];
     var.gam[i] = (var.R_var[i]*ps.gamp_theta + ps.gamp_phi*var.R_mean[i]) / (var.R_var[i] + ps.gamp_phi);
     var.nu[i] = ps.gamp_phi*var.R_var[i] / (var.R_var[i] + ps.gamp_phi);
     var.pi[i] = 1.0 / (1.0 + (1.0 - ps.gamp_lam) / ps.gamp_lam * sqrt((var.R_var[i]+ps.gamp_phi)/var.R_var[i]) * exp( \
      -(powf(var.R_mean[i],2)) / (2*var.R_var[i]) + (powf(var.R_mean[i] - ps.gamp_theta,2)) / (2*(ps.gamp_phi+var.R_var[i])) ));
     var.X_mean[i] = var.pi[i] * var.gam[i];
     var.X_var[i] = var.pi[i] * (var.nu[i]+ pow(var.gam[i],2)) - pow(var.pi[i] * var.gam[i],2);
  }
}
The code is slightly longer then matlab, and looks somewhat uglier since the vector loop operation have to be explicitly expanded. I am not yet happy with performance - since Matlab is highly optimized for the matrix product of dense matrices. I still need to think how to improve - maybe by blocking and using lapack...

Monday, November 14, 2011

SpotLight: Zeno Gantner - MyMediaLite


Here are some more details I got from Zeno Gantner, about his collaborative filtering project: MyMediaLite.
> 1) What is the focus of your library?
The focus of MyMediaLite is to provide
  1. a collection of useful recommender system algorithms,
  2. a toolkit for experimentation with such methods, and
  3. practical tools for using and the deploying the methods in the library.

There is no single specific target audience - we would like to cater both researchers and practitioners, as well as people who teach or learn about recommender systems.

Our long-term vision is to be a kind of Weka for recommender systems. We are not there, yet. For example, we only have command-line tools that expose most of the library's functionality, but no fancy GUI program.

While scalability and performance are important to us, we do not sacrifice usability for it. Thus, the library offers many useful additions beyond the raw algorithm implementations:
  • incremental updates for many models
  • storing and reloading of learned models
  • support for different text-based file formats, database support

What MyMediaLite is not: an off-the-shelf package that you can plug into your online shop. But using MyMediaLite for your online shop relieves you of having to implement a recommendation algorithm. You can use an efficient, well-tested implementation from the library instead, which can save you a lot of work.

So if you are a .NET developer and are looking for a recommender system/collaborative filtering solution, MyMediaLite may be worth a look ...

> 2) Who are the contributors?
Originally, there were 3 authors in our lab who worked on MyMediaLite. After the project "MyMedia", which was responsible for MyMediaLite's birth, ended, I am the main author. A guy from BBC R+D (hello, Chris!) has ported parts of the library to Java (available on our
download page
).

I occasionally get bug reports and patches from users, but there has not been a major outside contribution, e.g. a new recommender algorithm, yet. But I would be very happy accept such contributions. If you (the readers of Danny's blog) want to contribute, there isa list of interesting methods that could be implemented for MyMediaLite in our issue tracker.

The development process is really open. I keep the source code on github and gitorious. There is also a Google Group, a public issue tracker, and plenty of documentation, so it should be rather easy to start working on MyMediaLite.

My current goal is to turn MyMediaLite into a community project, that's why I go to conferences to present MyMediaLite, write about it on Twitter, do interviews, etc.

> 3) What are the main implemented algorithms?
The library addresses two main tasks: rating
prediction
(very popular in the public eye because of the Netflix
Prize) and item
recommendation
from implicit/positive-only
feedback
. The latter task is particularly important in practice, because you always have that kind of data (clicks, purchase action), while you have to ask your users for ratings, and they will not always give them to you.

So for both tasks we have simple baseline algorithms (average ratings, most popular item, etc.) that you can use to check whether you screwed up: a rating algorithm should always have more accurate predictions than the item average, and an item recommendation algorithn should always come up with better suggestions than the globally most popular items).

Beyond that, we have several variants of k-nearest-neighborhood (kNN) algorithms for both tasks, both based on interaction data (collaborative filtering) and on item attribute data (content-based filtering).

Most importantly, we have matrix factorization techniques for both tasks. If you have enough data, those are the models to use. For rating prediction, we have a straightforward matrix factorization with SGD training that also models user and item biases. For item recommendation, we have weighted regularized matrix factorization (WR-MF), which is called weighted-alternating least squares in GraphLab, and BPR-MF, which optimizes for a ranking loss (AUC).

> 4) Out of the implemented algorithms, which 2-3 perform best (on datasets like netflix, kddcup?)
I would go for the matrix factorization techniques mentioned above.

> 5) Do you have some performance results (for example on netflix data) in terms of speed and accuracy?
We have some results on our website:
http://www.ismll.uni-hildesheim.de/mymedialite/examples/datasets.html
http://www.ismll.uni-hildesheim.de/mymedialite/examples/recsys2011.html

For single-core matrix factorization, on an Intel(R) Xeon(R) CPU E5410 with 2.33 GHz, one iteration over the Netflix data takes, depending on the number of factors, 2 (10 factors) to 10 minutes (120 factors).

> 6) Which parts of the library are serial and which are parallel?
Most parts of the library are serial. We have parallelized some parts that are really easy to parallelize, e.g. cross-validation, where the experiment on each fold is entirely independent of the other folds.

We have one parallel SGD method for rating prediction MF. It is based on work presented at this years's KDD by Rainer Gemulla, which basically separates the training examples into independent chunks, and then performs parallel training on those chunks.

I implemented the method because I saw the talk and liked the paper, and I wanted to try how parallelization in C# works - it is very, very simple.

I will try to add more parallelized algorithms, but it is not the main development focus. Most feature requests from users are concerned about evaluation protocols and measures, and features of the surrounding framework.

> 7) What do I need to install if I want to use the library?
The byte code binaries are the same on all platforms. You need a .NET 4.0 runtime. On Windows, Microsoft .NET is already installed, so all you need to do is to download the library and tools and run them. On Mac OS X and Linux (and other Unix variants), you need Mono 2.8 or later, which you may have to install. The latest Ubuntu comes with Mono 2.10.5 installed by default, so there you can also just run it.

We have tutorial on our website on how to run the command line programs.

If you want to build the library, you also need MonoDevelop, which runs on all major platforms. You can download it for free. On Windows, you can use Visual Studio 2010 instead.

> 8) Some companies are limited by GP license. Did you consider moving to more industry friendly license, or do you target mainly academic usage?
While I do not rule out a license change in the future, I am quite happy with the current license. MyMediaLite was derived from a codebase with a much more restrictive license (educational/research use only).

The GPL is actually quite industry-friendly, although not as permissive as the Apache or
BSD licenses. The Linux kernel is also licensed under the GNU GPL. Is there a business that does not run Linux because of its license?

I am not a lawyer, but the GPL allows companies and all other users everything they need to do with MyMediaLite:
  • run, modify, and distribute the code
  • create derived works
  • sell it
  • use it in their online shop to make more money
  • set it up as a web service, and sell recommendations as a software
    service

If you modify the library, add new features etc., you are not required to release its source code as long as do not distribute the software itself.

Basically there are only two things that you cannot do with MyMediaLite under the GPL:
  • redistributing (e.g. selling) the code or derived works and
    denying your users the rights that we gave to you
  • sue us over some dubious software patents you are holding

time-SVD++ is now implemented in Graphlab

One of the algorithms to perform better in the KDD CUP contest this year is Koren's time-SVD++ algorithm. You can find more details in a previous post.

To summarize, according to our testing, we got the following validation RMSE on KDD CUP data:
(as lower prediction error, the better..)

1 Neighborhood model Adjusted cosine (AC) similarity 23.34
2 ALS                                                                              22.01
3 Weighted ALS                                                              18.87**
4 BPTF                                                                             21.84
5 SGD                                                                              21.88
6 SVD++                                                                         21.59
7 Time aware neighborhood                                             22.7
8 time-SVD                                                                      21.41
9 time-SVD++                                                                 20.90
10 MFITR                                                                        21.30
11 time-MFITR                                                                21.10
12 Random forest                                                              26.0


(Weighted ALS did not perform well on test data).
As can be seen from the above table, time-SVD++ was the best performing single algorithm
on the KDD CUP data. I also got the same impression when talking to Yehuda Koren.

time-SVD++ is now implemented as part of GraphLab's collaborative filtering library.
You are all welcome to try it out!

Saturday, November 12, 2011

GraphLab feedback this week

I got this email from Zeno Gantner:

"Hi Danny,

how are things? Any new enhancements for GraphLab planned? Its CF
library really looks impressive. Are there any plans for .NET/Mono bindings?

We had a poster about MyMediaLite at ACM RecSys in Chicago, and
mention GraphLab in the "related work" section:
http://ismll.de/mymedialite/download/mymedialite-recsys2011.pdf

Best regards from Germany,
Zeno"


I looked at the poster above and actually the system looks pretty interesting. One thing they have covered well is different cost functions and evaluation metrics like AUC and precision recall. I will probably follow up in this blog once I learn mode.

Last week I visited Ohio State University and had some very interesting meetings. One of the recent GraphLab users there is Josh Ash, research programmer who is working on Hyperspectral imaging. Here is the feedback I got from him:

"Hi Danny,

It was great meeting you as well, and an especially unusual coincidence that you visited the very week after I discovered and started using GraphLab! For the user's page, you can note that I am using GraphLab for "Bayesian analysis of hyperspectral imagery." I am still building out the model (mathematically and in code), but so far I am very happy with the usability and results from GraphLab. Currently I'm getting a 160x speedup using Java-GraphLab on an 8-core machine versus Matlab single-core. In addition to 8 cores, part of the speedup is due to better implementation (2nd time around is always better) and part due to Java being more efficient on looping over discrete random variables.

I will send more detailed notes about my GL experiences (installation, usage, suggested features, etc.) at a later date. Shortly, I'll also likely create a small web presence for this project so I can pass along that address as well. In the meantime, if you want a link for the 'users' page, you can use my home page: http://www2.ece.ohio-state.edu/~ashj/

Cheers!
Josh"


Justin Gardin from Stony Brook University, NY sent me the following description of his project. He finds GraphLab "amazingly fast."
"I am working on bioinformatic sequence motif analysis. The concept is pretty simple actually. Break up a bunch of sequences which are known to contain a binding site(binding site sequence unknown) in to a matrix of word counts. Each rows equals to a single permutation of ACGT in K slots. Each column equals distance along the sequence. A entry equals how many times the specific "word" occurs in that window of the sequences. Apply non-negative matrix factorization with dimension reduction. Set the reduction equal to the number of motifs you expect to come out of the data (usually trial and error). Then the U and V matrices represent the contribution of each kmer, at each position (respectively), to the motifs in the dataset. If there are big peaks in the U and V matrices at positions and similar words, then there exists a conserved motif in the dataset.

I follow up this analysis with a special case of gibbs sampling, which is initialized with the expected sequence motif from the U matrix, and weight the sampling by the V matrix. Works really well.

Justin Gardin"


Mike Davies from Cambridge Univ is documenting his GraphLab experience in his blog.

Tuesday, November 8, 2011

GraphLab plays poker?

This week I am visiting CMU. I always enjoy hearing about new projects and see if I can help them utilize GraphLab for speeding up the solution, and allowing solutions for larger problems than where previously possible.

One of the interesting projects I heard about is from Xi Chen from the machine learning dept. and Qihang Lin from Tepper school of business, who are working on finding the best strategies for winning in poker games. And here is how Qihang describes the problem:

"Strictly speaking, in a poker game problem, we try to solve the Nash equilibrium which is the best strategy for each player. This means we are guiding people to win the poker game. According to the definition of a Nash equilibrium, if any one deviate from the equilibrium stategy, he will definitely lose money. However, poker game is a zero-sum game, which means the total money of all players are fixed. Therefore, if one player loss money by not playing equilibirum strategy, the other player will win the money from him. So, in order to beat the opponent in a poker game, we just need to compute the Nash equilibrium and play according to it.

There are many poker games with different rules. We are only interested in the family of poker games which consists of rounds of new cards dealing and betting, for example, Texas Hold'em Poker. This kind of games is also called sequential game with ordered signals in game theory. Because the size of the problem, no one can compute the exact Nash equilibrium for Texas HOld'em. The best strategy is found by Andrew Gilpin who was a PhD student in CMU advised by Prof.Tuomas Sandholm.
Here are the papers: http://www.cs.cmu.edu/~sandholm/gs3.aaai07.pdf
http://www.cs.cmu.edu/~gilpin/papers/iterated.aaai08.pdf

The first paper talks about how to reduce the size the game by shrink the states. The second paper talks about a gradient-based algorithm for solve the Nash equilibrium.

Gilpin and Sandholm also implement the Texas Hold'em player software. It is called GS3.

You can also play by yourself with their software in here

The main idea of their papers is to first formulate this problem as a saddle point problem, minmax x^TAy, and apply the algorithm by Nesterov (2005), ''Excessive gap technique in nonsmooth convex minimization''. The advantage of this algorithm is that, it only uses matrix vector multiplication, therefore it can handle the problem with a huge size. The disadvantage is that the algorithm need O(1/e) number of iterations to find a solution which is e-optimal. This is a slow convergence rate such that the error e of the solution found by Gilpin is usually very large. Therefore, the solution they implmemnt in GS3 is not the perfect strategy. There is a hope we can win their money by using the perfect Nash equilibrium strategy.

We try to use interior point method to solve the saddle point problem, minmax x^TAy, becasue to find a e-optimal solution, interior point method needs only O(log(1/e)) number of iterations. This is a much faster convergence rate, which can guarantee us a more accurate solution than GS3. However, the challeging is how to make interior point method scalable to the poker game which has a large size. We hope you linear system solve can save us.

The size I shown you is the size of a poker game called Rhode Island Poker. No body play it in practice. But people in academic invent this game because it is similar to Texas Hold'em but has a smaller size. They can test and compare their strategies on this smaller game.

Let me explain the structure and the size of the matrix A. A is called payoff matrix. The game happens as follows. A private card is dealt to each player, and they start betting money. Betting of money is a smaller game which has 9 possible outcomes. For example, player one bet 5 dollar. player two raise the bet to 10 dollar, and then player one follows the raise by adding 5 more dollars to his own bet. Then the betting round finishes. This is just one of the 9 possible procedures of the betting round. After the betting round, a public card is dealt in the middle of the table and it is shared by the two player. Following, the same betting round repeats, which has 9 possible outcomes also. In the end of a Rhode Island poker, each player should have one private card and there are two public cards in the table and the players bet three time totally.


Each of column of A corresponds to a scenario path observed by player one.
Each of row of A corresponds to a scenario path observed by player two. So we just LABEL the columns and rows by the scenario path. Apparently, the scenario observed by the two players are different because of private card. For exmaple, player one get a A and player two get a K as private cards. Then they start the betting round and finish the betting round by outcome 1 for instance. Then the public card is dealt which is Q. And then they bet and finish betting by outcome 2. Then the last public card is dealt which is J and they bet and finish betting round by outcome 3. In this example, the scenario path for first player is A1Q2J3 and the one for player two is K1Q2J3. This game is finished and the first player get the money equal to the number in the A1Q2J3 column and K1Q2J3 row of matrix A. Therefore, you can see the size of the problem is exactly the number of possible scenario of the game observe by each player. A will be a matrix with 52*9*51*9*50*9 rows and columns.

Now, let's see the sparsity structure of matrix A. The senorita path observed by the two players can only be different in the first lable, which is the private card. See example, A1Q2J3 and K1Q2J3. Therefore, the matrix A must be block diagonal. For example, the element of A in column A1Q2J3 and row K1K2J3 must be zero, because the second public must be the same for both player. A1Q2J3 and K1K2J3 can not be the scenario observed by the two players at the same time. Therefore, there should be 9*52*9*51*9 blocks. Inside each of the block, the column is labeled by &***** and row is labeled by $***** with ***** is the shared segment of the label, but & and $ are different. & and $ could have 52 possible choice. Therefore, each block should have a size of 52 by 52. Each block has a size of 52*9 by 52*9 because I re-sort the order of columns and rows of A in order to make it compatible with the block structure of the constraint matrix E and F. After the resorting, we group some of the blocks, such that we have 52*9*51*9 blocks but the size of each block is increased to 52*9.


Overall, it is going to be very exciting to try and solve this problem in GraphLab.
Using back of the envelope calculation it seems that each Newton step has a matrix of 25,000,000,000 non-zeros..
I will post more once we make more progress.

More on GraphLab SVD

As I am getting more questions about using GraphLab SVD I thought about elaborating some more.

This is an email I got from Ehtsham Elahi, from a startup called http://change.org

Hi Danny, thank you for your great help, it is working now on mac os :) and performance is much much better than Mahout, I ll send the detailed benchmark...
I have talked to my Manager and he agrees that it will be cool to have our Mahout vs Graphlab results getting published . The organization is change.org and the matrix over which we have been testing both Mahout and Graphlab is 5Million by 376 column matrix. Approximately 70 million out of the total possible values are non-zero.

Best,
Ehtsham


Another nice feedback from Nihar Sharma, Caltech:
I am using GraphLab for a project course in ML here at Caltech and I am excited to see what kind of results it has on our dataset. Separately, I am also working on classification problems with the Astronomy department here and will definitely test GraphLab and let you know how it does on those datasets.

-Nihar



Here are some common questions I am getting:
1) How to read GraphLab SVD output?
If you are using matrix market input format, there will be 4 output files
- datasetname.V
- datasetname.U
- datasetname.EigenValues_AAT
- datasetname.EigenValues_ATA

The matrix A ~= U* diag(EigenValues_AAT) * V'
Alternatively, the matrix A ~= U * diag(EigenValues_ATA) * V'

In theory eigenvalues of AAT and ATA should be identical but because of numerical error there may be some differences
especially for the smaller eigenvalues.

If you are using itpp output format, then you should load your output file using itload('outputfilename') in matlab/octave
and then you will see the variables U, V, EigenValues_AAT, EigenValues_ATA. For example, after running 5 netflix
iterations you will see:
>> itload('netflix-20-1.out');
>> whos
Name Size Bytes Class Attributes

EigenValues_AAT 6x1 48 double
EigenValues_ATA 6x1 48 double
V 3561x6 170928 double
U 95526x6 4585248 double


2) How to compute predictions using GraphLab SVD?
The way to compute predictions using SVD, is by taking the matching row of U, the matching column of V and multiply the product with the
matching diagonal. Here is a matlab example:
>> A=rand(3,4);
>> [u,d,v]=svd(A*A');
>> A*A'

ans =

1.0738 0.7018 0.8807
0.7018 1.1564 0.8954
0.8807 0.8954 1.0851

>> u*d*v'

ans =

1.0738 0.7018 0.8807
0.7018 1.1564 0.8954
0.8807 0.8954 1.0851


And now you compute predictions as follows. Assume you want the first row and first column of A*A'. In this case you do:
>> u(1,:)*(diag(d).*v(1,:)')

ans =

1.0738

3) What is your recommended matrix factorization method for collaborative filtering?
the best algorithm we found to perform on KDD CUP data was time-SVD++ - I am going to implement it soon in Graphlab so you could try it out as well. See our workshop paper, pointed from here: http://bickson.blogspot.com/2011/08/efficient-multicore-collaborative.html