Large Scale Machine Learning and Other Animals: March 2013

Wednesday, March 27, 2013

The 2nd GraphLab workshop is coming up!

An update: Just got a limited number of discount codes for this blog readers. The first few to email me will get 30% discount, in addition to the early bird rate!!

Following the great success of the first GraphLab workshop, we have started to organize this year event, in July at the bay area. To remind you, last year we wanted to organize a 15-20 people event, which eventually got a participation of 300+ researchers from 100+ companies.

The main aim of this year workshop is to bring together top researchers from academia, as well as top data scientists from industry with the special focus of large scale machine learning on sparse graphs.

The event will take place Monday July 1st, 2013 in San Francisco. Early bird registration is now open!

Preliminary agenda

A (preliminary) list of our program committee:

Deepak Agarwal, LinkedIn
John Mark Agosta, Toyota InfoTechnology Center USA
Alex Averbuch, Neo4j
Eric Bieschke, Pandora Internet Radio
Jim Blomo, Yelp!
Mauricio Breterniz, AMD
Matthias Broecheler, Auerelius
Igor Carron, Nuit Blanche & Space Engineering Research Center
Avery Ching, Facebook
Jike Chong, CMU SV Campus
Brad Cox, Technica Corporation
Yogesh Dalal, Ebay
Ranjit Desai, Adobe
Ted Dunning, MapR
Michael Draugelis, Lockheed Martin Corporation
Frank Elliot, Opera Solutions
Baldo Faieta, Adobe
Hulya Emir-Farinas, Greenplum
Carlos Guestrin, University of Washington
Andy Harbick, Rosetta Stone
Tamir Hazan, Toyota Technical Institute Chicago
Steven Hillion, Alpine Data Labs
Nilesh Jain, Intel Labs
Lee Jones, Cisco
Nick Kolegraff, Rackspace
Edo Libery, Yahoo! Labs
Ben Lorica, O'Reilly
Michael Mahoney, Stanford
Norbert Martinez, Sparsity Technologies
Charles Martin, Gerson Lehman Group
Vahab Mirrokni, Google Research
Ash Munshi, Knobout Inc.
Jan Neumann, Comcast
Andrew Nystrom, Thomson Reuters
Josep Lluís Larriba Pey, Univirsitat Politecnica De Catalunya
Nikolaos Vasiloglou II, Ismion
Udi Weinsberg, Technicolor Labs
Joel Welling, Pittsburgh Supercomputing Center
Ted Willke, Intel Labs
Josh Wills, Cloudera
Joshua Vogelstein, Duke
Lei Tang, Walmart Labs
Bryan Tompson, Systap
Tao Ye, Pandora Internet Radio
Nezih Yigitbasi, Intel Labs

A preliminary list of our sponsors:

The GraphLab workshop is co-sponsored by the Linked Data Benchmark Council (LDBC), a new EU FP7 project that aims to establish industry cooperation on graph database benchmarks, benchmark practices and benchmark results. A recommended event is the SIGMOD GRADES workshop, June 23rd in NY.

Wednesday, March 20, 2013

Twitter WTF (Who to Follow) Paper

I got the following interesting paper from my collaborator Aapo Kyrola:
Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. WTF: The Who to Follow Service at Twitter. Proceedings of the 22th International World Wide Web Conference (WWW 2013), May 2013, Rio de Janeiro, Brazil.

It details Twitter "Who to Follow" recommendation service. In a nutshell Twitter uses two algorithms: an egocentric random walk (personalized pagerank) and SALSA which is a bipartite random walk (similar to HITS algorithm).

In terms of infrastructure they use Twitter Cassovary graph processing system, on top of a single multicore machine, which is rather surprising considering Twitter graph size. Anyway this shows that proper efficient implementation on a single multicore machine can scale to very large models.

Sunday, March 17, 2013

Spotlight: Large Scale Distributed Deep Networks

I got from Liu from Tencent, the following paper from Google:
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng Large Scale Distributed Deep Networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada, United States, December, 2012.

It uses a distributed implementation of SGD/LBGFS for training deep networks. It is one of the largest ML deployments I have seen so far: up to 10K cores, 5K machines. In a nutshell they factorize the problem into regions, run SGD in each region separately and then use a central server to merge the model from the different regions. They also support asynchronous computation of the different nodes.

And they did not fail to mention GraphLab :-)

We considered a number of existing large-scale computational tools for application to our problem, MapReduce [24] and GraphLab [25] being notable examples. We concluded that MapReduce, designed for parallel data processing, was ill-suited for the iterative computations inherent in deep network training; whereas GraphLab, designed for general (unstructured) graph computations, would not exploit computing efﬁciencies available in the structured graphs typically found in deep networks.

I am sure I got their meaning - if anyone knows let me know.

Tuesday, March 5, 2013

A nice collaborative filtering tutorial "for dummies"

I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer's Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.
The first one of them is Chapter 2: Collaborative Filtering and Chapter 3: Implicit Ratings and Item Based Filtering.

Intel Labs report on GraphLab vs. Mahout

I have some very interesting news to report. I got from Nezih Yigitbasi, Intel Labs, the following graph:

It compares Mahout vs. Distributed GraphLab on the popular task of matrix factorization using ALS algorithm (alternating least squares) on Netflix data. The bottom line is that GraphLab is about x20 faster than Mahout.

And here is the exact experiment setup, I got from Nezih:

N is the number of ALS iterations, D is the number of latent factors. The experiments have been conducted on a 16 node cluster.
We start GL as mpirun -hostfile ~/hostfile -x CLASSPATH ./als –ncpus=16 --matrix hdfs://host001:19000/user/netflix --D=$LATENT_FACTOR_COUNT --max_iter=$ITER_COUNT --lambda=0.065 --minval=0 --maxval=5
To run mahout ALS, we use the factorize-netflix.sh script under the examples directory. It should be run as ./factorize-netflix.sh /path/to/training_set/ /path/to/qualifying.txt /path/to/judging.txt
In our test cluster we have 16 machines each with 64GB of memory, 2 CPUs (Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz [8 cores each]) and 4 x 1 TB HDDs. The machines communicate over a 10Gb Ethernet interconnect.
The Netflix dataset has been splitted into 32 equally sized chunks and then put into HDFS.

Monday, March 4, 2013

The world's coolest machine learning internships - part 2

Continuing the great success of my last year blog post regarding machine learning internships last summer (more than 15,000 page views, posting in hacker news and reddit.com) I will start again collecting some of the interesting opportunities I hear about for this summer.

This is what I got from Mauricio Breternitz from AMD: AMD Research has exciting opportunities for interns who will be conducting research in one of key areas of Systems Research (Processor, Software stack, architecture) supporting the development of new systems and architectures.
Required skills and interests: - Programming in Java or C++, a scripting language such as Perl or other, Linux, Cloud computing - Hadoop, system-level performance analysis Microarchitecture and System-Level performance analysis: CPU utilization, disk utilization, IPC, cache behavior
Desired skills: Data mining and machine learning (ML) concepts; Text analytics (unstructured data), parsing XML or HTML and extracting Big data - Experience with Hadoop - RESTful APIs and related coding experience. Client device protocols and virtualization experience would be highly desirable. Multiple internship projects are available. Interested candidates please contact Mauricio Breternitz
( Mauricio.breternitz@amd.com )

Would you like to use data to solve one of the world's most important problems? Udemy is on a mission to democratize education. For us, that means two things: 1) Enabling the top experts in the world to teach any student, anywhere, and 2) Radically lowering the price point on a top quality education. With over 6,000 courses published and 600,000 students taking courses on Udemy, we're on our way, but we need your help! We are a technology company in our core, so we track every single bit of learning data. We are looking for amachine learning contractor / intern to use this data to work on problems like: Which courses are the best fit for a particular student? What make a student complete a course vs drop out? Who is most likely to enroll in a 2nd course? If you're interested, please shoot us an email at jobs@udemy.com with subject "Machine Learning Contractor". We'd love to hear from you.

I got the following open positions from Srinivassan Soundar: Health Informatics

Online Learning

Learning with Noisy Labels

Active Learning

They also have a few fulltime positions too.

Active Learning , High Performance Data Mining and Model Verification and Validation.

Want to work on the largest scale music recommender and playlisting engine in the US? Pandora's playlist team is building the next innovations in recommendation algorithms that help hundreds of millions of listeners discover music they love. We're looking for 1-2 interns (preferably 2nd or 3rd year Ph.D. Students) who has a research interest in ML and are passionate about music. We are also open to a longer term research collaboration with universities. Potential topics include personalization, scalable algorithms, real time and effective recommendation measurement, etc. This position is already filled.

My collaborator Ted Willke sent me the following: We’d like to perform a comparative study of various graph-based programming models for machine learning algorithms this summer, looking at GraphLab, D4M, Galois, and possibly others. We’d love to have a smart and passionate graduate student join us for 3 months (minimum). Contact Ted: theodore.l.willke@intel.com

My friend Udi Weinsberg from Technicolor raised my attention that Technicolor are also looking for interns. Technicolor Palo Alto research lab studies personalized computing, data privacy and recommendation systems. You can apply here.

I got the following from Jan Neumann from Comcast, who is looking for Industry-leading research in audio/video information retrieval and content discovery technologies to help millions of households discover video and music content on their TV, PC, Phone, and Mobile devices. Comcast's Washington DC research lab is looking to fill 2-3 graduate student intern positions for this summer (minimum of 12 weeks, May through September) . Projects can focus on using Social Networks for Recommendations and Click-through Prediction, NLP for Voice-based Interfaces, and/or Video Search/Segmentation of premium video. Read more.

This is what I got from Grant Ingersoll, a well known Mahout contributor: LucidWorks, the leading commercial company for Apache Lucene and Solr, is looking for interns to work on building next generation search, analytics and machine learning technologies based on Apache Solr, Mahout, Hadoop and other cutting edge capabilities. This internship will be practically focused on working on real problems in search and machine learning as they relate to Lucid products and technologies as well as open source. Interested students (see eligibility below) should send their resume/profile, course work and evidence of open source activity (github account, ASF patches or other, etc.) to careers@lucidworks.com.

Walt Disney Animation Studios have a summer internship for improving the quality of animation data. Apply here.

Please note this position was already filled. I will post more positions as they come through..
I'm planning on hosting an intern for Summer 2013. The project will be related to online learning and large data, motivated by click-through-rate prediction for ads targeting. I'm also interested in the interaction between machine learning and auction mechanisms. The plan is to do something publishable with the goal of getting a paper out. The details of the project are flexible based on the skills and interests of the intern. Ideally I'm looking for someone who already has a strong background in online learning, optimization, or auction theory. Contact Brandan MacMahan.

The Systems department at the IBM T.J. Watson Research Center (Yorktown Heights, NY) has an opening for a Summer Research Intern. The candidate will be conducting research in one of key areas of Systems Research that will lead the development of new systems. The candidate will have theopportunity to work on real systems while pursuing innovative research of
both industrial and academic interest.
Required skills: - Programming in Java or C++, a scripting language such as Perl or other,
SQL - Performance tuning under Linux and at least one other Unix environment
Desired skills: - DBMS query performance tuning, implementing UDFs (user defined
functions) - Data mining and machine learning (ML) concepts; hands on experience with
R, SPSS, SAS - Fraud detection (models, algorithms, architectures for real-time fraud
detection) - Text analytics (unstructured data), parsing XML or HTML and extracting
data - Experience with Hadoop - RESTful APIs and related coding experience. Contact: yefim@us.ibm.com

Large Scale Machine Learning and Other Animals

Wednesday, March 27, 2013

The 2nd GraphLab workshop is coming up!

Wednesday, March 20, 2013

Twitter WTF (Who to Follow) Paper

Sunday, March 17, 2013

Spotlight: Large Scale Distributed Deep Networks

Tuesday, March 5, 2013

A nice collaborative filtering tutorial "for dummies"

Intel Labs report on GraphLab vs. Mahout

Monday, March 4, 2013

The world's coolest machine learning internships - part 2

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax