Monday, May 6, 2013

Speeding up parallel ALS

My collaborator Joey Gonzalez sent me the following paper:
H.-F. Yu, C.-J. Hsieh, S. Si, I. S. Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. IEEE International Conference on Data Mining(ICDM), December 2012.

Which got the best paper award in ICDM 2012. It is a well written a paper proposing a technique for speeding up ALS (alternating least squares) when executed in parallel, by using coordinate descent. It also has a good review of the two highly repeating building blocks in matrix factorization algorithms: alternating least squares and stochastic gradient descent and a discussion on how to parallelize them.

When I read the paper, it reminded me of previous work of Steffen Rendle, which I shared with my blog readers here:
Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 635-644.

Basically, it is the same construction. (Unfortunately it seems that ICDM guys where not aware to the previous SIGIR paper).

Here is the algorithm update rule from Steffen's paper:
And here is the update rule from ICDM paper:
Anyway aside from slightly different notations it is basically the same algorithm. And the conclusions of both papers are identical: this constructions improves performance of parallel ALS, and have favorable performance relative to SGD.

I asked Prof. Inderjit Dillon from The University of Texas at Austin about the relation between the papers and I got the following reply:
Solving the one-variable problem is simple, and was not intended to count as a "contribution" (we had similar 1-variable solutions in an earlier paper on coordinate descent for NMF, which appeared in KDD 2011). The main contribution of our ICDM paper is to present a parallel coordinate descent algorithm (on multi-core and distributed memory machines) for matrix factorization for missing value estimation. The results show it outperforms ALS and SGD, which is what most people seem to use in industry (and which is what prompted me to investigate co-ordinate descent for the problem).
To anyone who is interested in reading more, my recommendation is to read first the ICDM paper for getting a general overview, and then read the SIGIR paper for getting more specific.

To those of you who are interested in additional parallel coordinate descent method for  L1 loss function, you are welcome to take a look at our ICML paper described here.

No comments:

Post a Comment