Large Scale Machine Learning and Other Animals: March 2015

Tuesday, March 31, 2015

MLConf Seattle

MLConf Seattle is coming up on May 1st. Readers of my blog are welcome to use a 20% discount code. Some of the interesting speakers are Carlos Guestrin, Dato's Founder & CEO and Prof. at the University of Washington. Xavier Amatriain, previously at Netflix and now at Quora. Josh Wills, Director of data science at Cloudera. Misha Bilenko from MS Azure ML.

Wednesday, March 25, 2015

A new time series anomaly detection dataset from Yahoo!

I got this from my colleague Micky Fire: Yahoo! just released a freshly new time series dataset for anomaly detection.

Tuesday, March 24, 2015

Data Science Summit - why should you care?

The data science summit is a non-profit event is organized by Intel, Comcast, Pandora, Dato, Cloudera and O’Reilly Media. The Summit brings together researchers and data scientists from academia as well as industry to discuss state of the art data science, applied machine learning and predictive applications. The conference agenda has been co-created with Dr. Ben Lorica, Chief Scientist of O’Reilly Media who serves as the content manager of the O’Reilly Strata Conferences.

We are expecting 1000 data scientists to attend on Monday July 20 in SF, as this year we were able to group together an amazing group of data science leaders. We got speakers from four major data science domains:

Infrastructure
Data engineering
Machine learning and predictive applications
Visualisation of big data

From the infrastructure viewpoint, Prof. Mike Franklin (UC Berkeley) is the Director of Berkeley AMPLab and a co-founder in DataBricks, the cloud service hosting Spark. Dr. Misha Bilenko is a senior researcher at Microsoft, working on Microsoft Azure ML, a machine learning cloud service. Ron Kasabian is VP Big Data at Intel who will cover Intel effort in the data science domain. Prof. Alex Smola is the creator of the Parameter Server which is an efficient distributed infrastructure for ML applications deployed in Google and other companies.

We call data engineering the data cleaning and transformation that needs to happen before we can apply the machine learning methods. Wes McKinney, is the creator of the popular pandas Python data science package, who recently sold his startup to Cloudera. Pandas has a lot of slicing and dicing operations which help with quick data science. Prof. Jeff Heer (UW), is the creator of d3.js - the popular visualization software, and also a co-founder of Trifacta a data engineering startup. Trifacta allows you to visually specify complex data transformations that will be later executed on a cluster. Dr. John Mount is the author of the popular book "Practical Data Science with R".

D3.js visualization software

From the machine learning aspect, Prof. Carlos Guestrin (UW), is the founder and CEO of Dato, our popular big data analytics framework. Prof. Mike Jordan (Berkeley) Mike Jordan is famous for his work on neural networks, graphical models (specifically variational inference) and Bayesian non-parametric statistics. In recent years he's been working on statistical methods in Big Data. His recent Reddit AMA appearance (in which he bashed deep learning) generated a lot of chatter. Prof. Christopher Re (Stanford) has many applied works in this domain, one of the recent ones is DeepDive, a system which utilises domain specific knowledge and users feedback to improve modeling and predictions. Prof. Robert (Rob) Tibshirani (Stanford) is famous for his sparse L1 regression work (Lasso). Prof. Lisa Getoor (Univ. of Maryland) is the author of the popular book Introduction to Statistical Relational Learning. Prof. Jure Leskovec (Stanford) is known for his social network research. Recently he sold his startup Kosei to Pinterest.

In terms of predictive applications, Dr. Tao Ye is a senior scientist at Pandora Internet Radio working on their recommendation engine. Dr. Jan Neumann is manager of recommendations at Comcast. Joe Resinger is CTO and Co-founder of Premise Data, a mobile data collection platform. Dr. Soundar Srinivasan is a senior researcher at Bosch Research working on industrial sensor data.

Large scale social network visualization by Uncharted

We also plan to give the stage to a few younger startups that are working on ground braking research. Dr. Leo Meyerovicz from Graphistry will discuss GPU aided visualisation for graphs that were too big to visualize before.
Rob Harper, Lead architect at Uncharted, will also discuss hierarchical graph visualisation.

Dr. Kira Radinsky is the founder of SalesPredict, a sales lead ranking startup. A related and bigger company is C9 Inc, where Andy Twigg, chief technology officer will also give a talk about their data science. Prof. Eyal Amir (UIUC) will present ParkNav, a startup for helping finding parking. Alec Radford, head of research at indico.io, will present their cloud hosted deep learning company. Paul Dix, CEO of InfluxDB will present their time series database.

There is still an opportunity to get involved! Send me a note if you like to speak or sponsors the event

A special hotel discount is available for our guests.

Saturday, March 21, 2015

Graphistry: large scale data visualization

I connected with Leo Meyerovich, for a quick overview of Graphistry.

Who is behind Graphistry?

Graphistry spun out of UC Berkeley’s Parallel Computing lab last year. It stems from my Ph.D. on the first parallel web browser (Mozilla etc. are building new browsers around those ideas) and from Matt Torok (my RA), who built Superconductor, a GPU scripting language for big interactive data visualizations.

What does Graphistry do?

Graphistry scales and streamlines visual analysis of big graphs. Think answering questions about people (intelligence, sales, marketing), about things (data centers, sensors), and combinations of them (e.g., financial transactions). For example, we used it to crack a 70K+ node botnet a couple days ago. Our tool immediately revealed the accounts involved, their different roles, especially key accounts, and, after 30min of interactive analysis & googling, the credit card & passport theft operation it funneled to. Most tools can only sensibly show hundreds of nodes, and a couple open source ones handle tens of thousands, but we’re already pushing 100X more than that.

How do you use GPU?

We're taking the last 20 years of infoviz research out of papers and into accessible tools by (a) powering them with big yet economical clusters of GPUs and (b) prioritizing interaction design. The GPU side is cool. For example, our unusual backend has JavaScript orchestrating our GPU cluster via node-opencl. Likewise, we take advantage of recent breakthroughs — including our own — in optimizing irregular graph algorithms on GPUs for multiple magnitudes more data & speed. With all this power, we're deploying atypically smart visualizations that take advantage of computationally-intensive machine learning and physics algorithms. Likewise, we're adding interactive analysis tools on top that, till now, were impossible. I can write so much here!

What is your business model?

We currently work closely with customers on big problems (contact me if this sounds relevant). We’re actively working towards self-serve analyst tools for a couple industries, and want to share our APIs with internal dev teams and analytics providers to build tools for their more unique problems.

What is your target audience?

We currently like problems in IT (e.g., making sense of activity in big networks or many endpoints) and various security problems. We're starting to expand into problems in finance (e.g., risk, fraud) and sales/marketing (social & business networks).

Can you share some demo links?

I can’t yet share the interactive versions, but here’s a screenshot:

Are you looking for funding?

As you can probably attest, startup life is intense. We’re more interested in collaborating on good problems right now.

Are you hiring?

Graphistry is currently 5 Berkeley engineers — a mix of language designers, compiler builders, GPU hackers, and web devs — and that’s it! We'd especially love to talk to any frontend and data viz engineers about designing big interactive visualizations & tools that were previously impossible. Consider yourself invited to our new Oakland office for an amazing show-and-tell.

Anyone interested in watching some live Graphistry demos is welcome to join our Data Science Summit, July 20 in SF.

Thursday, March 19, 2015

Sense announced its big data analytics platform

Here is their announcement. It seems Sense is a cloud hosted data science platform.

Wednesday, March 18, 2015

Text by the Bay

I got from Alexy Krabrob a note about interesting text analytics conference he is organizing: Text by the Bay. April 24-25 at the Bay Area.

My readers are welcome to use discount code: TEXTDENNY which give 250$ off, until 3/31.

Lecture videos will be made available online.

Tuesday, March 17, 2015

GraphLab Create solution is 3rd place on the Diabetic Kaggle competition!

Our data scientist Hoyt Koepke used Graphlab Create deep learning package to participate in this Diabetic Kaggle competition. He released the solution here. Everyone is welcome to try it out!

Friday, March 13, 2015

Data Science Summit - Registration is Open!

The data science summit is a non-profit event is organized by Intel, Comcast, Pandora, Dato, Cloudera and O’Reilly Media. The Summit brings together researchers and data scientists from academia as well as industry to discuss state of the art data science, applied machine learning and predictive applications. The conference agenda has been co-created with Dr. Ben Lorica, Chief Scientist of O’Reilly Media who serves as the content manager of the O’Reilly Strata Conferences.

Confirmed speakers (preliminary list!)

Prof. Alex Smola - Google & Carnegie Mellon University
Prof. Jeff Heer - D3.js, Trifacta & University of Washington
Prof. Carlos Guestrin - Dato & University of Washington
Prof. Chris Re - Stanford University
Prof. Jure Leskovec - Pinterest & Stanford University
Prof. Mike Franklin - AMPLab UC Berkeley

Prof. Mike Jordan - UC Berkely
Dr. Andreas Muller - Scikit-learn & NYU
Wes McKinney - pandas & Cloudera
Dr. Misha Bilenko - Microsoft Azure ML
Dr. Tao Ye - Pandora
Dr. Jan Neumann - Comcast

Registration website: https://dato.com/events/conference15/

150$ early bird discount until April 1st.

Additionally, use the discount code DannysBlog to get additional 50$ off!

Thursday, March 5, 2015

FlashGraph: graph engine on a multicore machine with SSD array

I got this interesting paper about FlashGraph from my friend Yaron Weinsberg @ IBM. FlashGraph is a high performance graph engine which utilizes SSD array in a smart way to scale to very large datasets. Some impressive performance results on a 130B edges graph.

indico.io: machine learning as a service

With the growing excitement around machine learning technologies, I have connected with Gideon Wulfsohn who works with indico to understand better what they do.

0) A couple of sentences about indico

indico is helping individuals, small to medium sized teams and businesses translate their community’s pictures, documents and conversations into insightful feedback in minutes. Built with real life data and tailored to what you need, our pre-trained models balance accuracy and speed, allowing you to use powerful machine learning in realistic settings.

1) What is the indicio business model and user license?

The model is to build API endpoints that allow developers to rapidly prototype and deploy solutions within a predictive application. We love chatting with our users and are always looking to improve what we offer to better fit their needs.
Staying up to date with the latest research papers is a huge part of our development process, thus ensuring all models are tuned to industry standards.
We also have a private cloud offering for enterprise.

2) Which underlying deep learning toolkits do you use?

As Python continues to gain steam within the Data Science Community, Theano has stepped up as our de facto numerical computing library. Theano is designed to evaluate mathematical expressions fast with dynamic C code generation.
We have developed abstraction layers for interfacing with Theano for building models using convnets and RNNs. We have open sourced the RNN abstraction layer under the name Passage

3) Do you support GPU computation?

Thanks to the Theano, all of our models are runnable on either a CPU or a GPU.

4) What tricks do you use when building models using deep learning?

We are always looking for ways to speed up computational time. Here are a couple tricks:

Remove all redundant computation in sliding window classification strategies

http://arxiv.org/abs/1412.4526

Filter compression

http://arxiv.org/abs/1404.0736

Training a smaller model to emulate a bigger model

http://arxiv.org/abs/1412.6550

5) What is your target user. Do I have to be a deep learning expert? programmer? business analyst?

The target user is anyone with a bit of programing/problem-solving chops looking to answer a burning question. This often comes in the form of developers, entrepreneurs, hackathon goers, and small-medium sized businesses.

6) Which programming languages do you support?

Java, Javascript, Objective C, PHP, Python, R, Mashape, and Ruby

7) What is the typical dataset size where you find deep learning to be effective on. How many images?

As a rule of thumb, 100,000 examples is a good starting point for training a model from scratch.

Wednesday, March 4, 2015

BIDMach: faster machine learning on a single machine

I got this from my colleague Yucheng Low. BIDMach is a new open source project (BSD3 license) which speeds up basic machine learning algorithms using GPU.

Some impressive performance numbers on Github. For example:

Criteo Dataset

Criteo released a medium-sized (12 GB) dataset with a single target (click, no click) with a very sparse set of features. This is representative of many click prediction tasks in industry.

System	Nodes/Cores	npasses	AUC	Time	Cost	Energy(KJ)
Spark	8/32	10	0.62	964s	$0.64	1500
Spark	32/128	10	0.62	400s	$1.00	2500
BIDMach	1	1	0.66	81s	$0.01	6
BIDMach	1	10	0.72	805s	$0.13	60

IBM buys AlchemyAPI

I got this from my colleague Piotr Teterwak: IBM acquires AlchemyAPI which is a deep learning startup for identifying objects in images and text classification. Previously, I wrote about Alchemy here.

Monday, March 2, 2015

SkyMind: A new machine learning startup to support deeplearning4j

Yesterday I connected with Adam Gibson and Chris Nicholson from SkyMind, a new startup around support and maintenance of depplearning4j, one of the popular deep learning packages. To all the VCs who are reading this blog, please note that SkyMind is looking for funding.

What is SkyMind?

Skymind is the commercial support arm of Deeplearning4j, the distributed open-source deep-learning framework for the JVM. Adam Gibson created Deeplearning4j, and has spoken at Hadoop Summit, OSCon, Tech Planet and elsewhere. He's the author of the forthcoming O'Reilly book "Deep Learning: A practitioner's guide."

What is deeplearning4j license? what is SkyMind business model?

Deeplearning4j, ND4J (our scientific computing library) and Canova (vectorization) are Apache-2.0 licensed, which gives them IP protection on derivative works they create with our software.

Skymind builds "Google brains" for industry. Our software works cross-platform from server to desktop to mobile, and handles every major data type: image, sound, time series, text and video. What Red Hat is to Linux, we are to Deeplearning4j.

Which distributed systems does learning4j support? (Hadoop, Spark, Yarn?)

YARN,Spark. We also allow users to create standalone distributed systems using Akka and AWS.

Can GPU mode run distributed? Can you support multiple GPUs?

No Infiniband yet, but it can do internode coordination and leverage GPUs via ND4j.

From your experience: what is the typical speedup of GPU vs. CPU

We're finishing benchmarks now. Just implemented matrix caching and raw Cuda. Will know more numbers soon (plan to benchmark on gpu matrix caching with Spark).

What is the most powerful deep learning methods implemented in deeplearning4j? what are their typical use cases?

Sentiment analysis for text (which has applications for CRM and reputation management); image and facial recognition, which has wide consumer and security applications; sound/voice analysis, which is useful for speech-to-text and voice search; time series analysis, which is useful for predictive analytics and anomaly detection in finance, manufacturing and hardware.

What is your target user. Do I have to be a deep learning expert?

The entry-level data scientist who needs to productionize an algorithm focusing on unsructured data where traditional feature engineering methods have fallen over. Familiarity with machine learning ideas will help, but it's not necessary to get started. We introduce most of the crucial ideas on our website.

Which programming language interfaces do you support?

Java/Scala right now. We'll have a Bash command-line interface that loads models via JSON.

There are a few other deep learning libraries like Theano and Caffe. Can you outline the benefits of deeplearning4j (either in terms of accuracy or speed or distribution?)

Caffe was created by a PhD candidate at Berkeley. It specializes in machine vision and C/C++ based. Deeplearning4j is commercially supported, handles all data types (not just images), and is based on the JVM, which means it works easily cross platform.

Theano/PyLearn2 is written in Python and likewise serves the research community. It is useful for prototyping and widely used, but most people who create a working net in Python need to rewrite it for production. Deeplearning4j is production-grade from the get go.

Theano allows you to build your own nets but the generated gradients can be slow. Theano is also harder to get up and running cross platform. As for caffe, we integrate better:

https://github.com/deeplearning4j/deeplearning4j/issues/179

Theano and Caffe are released under a BSD license that does not include a patent claim and retaliation clause, which means they do not offer the same protections as Apache 2.0.

What is the typical dataset size where you find deep learning to be effective. how many images?

You don't need very much data for deep learning as long as you tune it right (dropout, rectified linear units,..). It also depends on the problem you're solving. If you're training a joint distribution over images and text for example, you may want more. For simple classification, you can get away with a more tuned algorithm (aka more robust to over fitting).

How do you deal with classification of imbalanced classes?

We sample with replacement and random DropOut and DropConnect between layers to learn different features.

Besides of classifying images to labels. Can you identify object locations in images? Can you find similar images?

With enough data, yes.

Large Scale Machine Learning and Other Animals