Friday, April 24, 2015

Making Visual Data a First-Class Citizen

Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love.” ― Fyodor Dostoyevsky, The Brothers Karamazov


City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

To respect the power and beauty of machine learning algorithms, especially when they are applied to the visual world, let's take a look at three recent applications of learning-based "computer vision" to computer graphics. Researchers in computer graphics are known for producing truly captivating illustrations of their results, so this post is going to be very visual. Now is your chance to sit back and let the pictures do the talking.

Can you predict things simply by looking at street-view images?

Let's say you're going to visit an old-friend in a foreign country for the first time. You've never visited this country before and have no idea what kind of city/neighborhood your friend lives in. So you decide to get a sneak peak -- you enter your friend's address into Google Street View.

Most people can look at Google Street View images in a given location and estimate attributes such as "sketchy," "rural," "slum-like," "noisy" for the given neighborhood. TLDR; A person is a pretty good visual recommendation engine.

Can you predict if this looks like a safe location? 
(Screenshot of Street view for Manizales, Colombia on Google Earth)

Can a computer program predict things by looking at images? If so, then these kinds of computer programs could be used to automatically generate semantic map layovers (see the crime prediction overlay from the first figure), help organize fast-growing cities (computer vision meets urban planning?), and ultimately bring about a new generation of match-making "visual recommendation engines" (a whole suite of new startups).

Before I discuss the research paper behind this idea, here are two cool things you could do (in theory) with a non-visual data prediction algorithm. There are plenty of great product ideas in this space -- just be creative.

Startup Idea #1: Avoiding sketchy areas when traveling abroad 
A Personalized location recommendation engine could be used to find locations in a city that I might find interesting (techie coffee shop for entrepreneurs, a park good for frisbee) subject to my constraints (near my current location, in a low-danger area, low traffic).  Below is the kind of place you want to avoid if you're looking for a coffee and a place to open up your laptop to do some work.

Google Street Maps, Morumbi São Paulo: slum housing (image from geographyfieldwork.com)

Startup Idea #2: Apartment Pricing and Marketing from Images
Visual recommendation engines could be used to predict the best images to represent an apartment for an Airbnb listing.  It would be great if Airbnb had a filter that would let you upload videos of your apartment, and it would predict that set of static images that best depict your apartment to maximize earning potential. I'm sure that Airbnb users would pay extra for this feature if it was available for a small extra charge. The same computer vision prediction idea can be applied to home pricing on Zillow, Craigslist, and anywhere else that pictures of for-sale items are shared.

Google image search result for "Good looking apartment". Can computer vision be used to automatically select pictures that will make your apartment listing successful on Airbnb?

Part I. City Forensics: Using Visual Elements to Predict Non-Visual City Attributes


The Berkeley Computer Graphics Group has been working on predicting non-visual attributes from images, so before I describe their approach, let me discuss how Berkeley's Visual Elements relate to Deep Learning.

Predicting Chicago Thefts from San Francisco data. Predicting Philadelphia Housing Prices from Boston data. From City Forensics paper.



Deep Learning vs Mid-level Patch Discovery (Technical Discussion)
You might think that non-visual data prediction from images (if even possible) will require a deep understanding of the image and thus these approaches must be based on a recent ConvNet deep learning method. Obviously, knowing the locations and categories associated with each object in a scene could benefit any computer vision algorithm.  The problem is that such general purpose CNN recognition systems aren't powerful enough to parse Google Street View images, at least not yet.

Another extreme is to train classifiers on entire images.  This was initially done when researchers were using GIST, but there are just too many nuisance pixels inside a typical image, so it is better to focus your machine learning a subset of the image.  But how do you choose the subset of the image to focus on?

There exist computer vision algorithms that can mine a large dataset of images and automatically extract meaningful, repeatable, and detectable mid-level visual patterns. These methods are not label-based and work really well when there is an underlying theme tying together a collection of images. The set of all Google Street View Images from Paris satisfies this criterion.  Large collections of random images from the internet must be labeled before they can be used to produce the kind of stellar results we all expect out of deep learning. The Berkeley Group uses visual elements automatically mined from images as the core representation.  Mid-level visual patterns are simply chunks of the image which correspond to repeatable configurations -- they sometimes contain entire objects, parts of objects, and popular multiple object configurations. (See Figure below)  The mid-level visual patterns form a visual dictionary which can be used to represent the set of images. Different sets of images (e.g., images from two different US cities) will have different mid-level dictionaries. These dictionaries are similar to "Visual Words" but their creation uses more SVM-like machinery.

The patch mining algorithm is known as mid-level patch discovery. You can think of mid-level patch discovery as a visually intelligent K-means clustering algorithm, but for really really large datasets. Here's a figure from the ECCV 2012 paper which introduced mid-level discriminative patches.

Unsupervised Discovery of Mid-Level Discriminative Patches

Unsupervised Discovery of Mid-Level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. In European Conference on Computer Vision (2012).

I should also point out that non-final layers in a pre-trained CNN could also be used for representing images, without the need to use a descriptor such as HOG. I would expect the performance to improve, so the questions is perhaps: How long until somebody publishes an awesome unsupervised CNN-based patch discovery algorithm? I'm a handful of researchers are already working on it. :-)

Related Blog Post: From feature descriptors to deep learning: 20 years of computer vision
The City Forensics paper from Berkeley tries to map the visual appearance of cities (as obtained from Google Street View Images) to non-visual data like crime statistics, housing prices and population density.  The basic idea is to 1.) mine discriminative patches from images and 2.) train a predictor which can map these visual primitives to non-visual data. While the underlying technique is that of mid-level patch discovery combined with Support Vector Regression (SVR), the result is an attribute-specific distribution over GPS coordinates.  Such a distribution should be appreciated for its own aesthetic value. I personally love custom data overlays.

City Forensics: Using Visual Elements to Predict Non-Visual City AttributesSean Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala. In IEEE Transactions on Visualization and Computer Graphics (TVCG), 2014.


Part II. The Selfie 2.0: Computer Vision as a Sidekick


Sometimes you just want the algorithm to be your sidekick. Let's talk about a new and improved method for using vision algorithms and the wisdom of the crowds to select better pictures of your face. While you might think of an improved selfie as a silly application, you do want to look "professional" in your professional photos, sexy in your "selfies" and "friendly" in your family pictures. An algorithm that helps you get the desired picture is an algorithm the whole world can get behind.



Attractiveness versus Time. From MirrorMirror Paper.

The basic idea is to collect a large video of a single person which spans different emotions, times of day, different days, or whatever condition you would like to vary.  Given this video, you can use crowdsourcing to label frames based on a property like attractiveness or seriousness.  Given these labeled frames, you can then train a standard HOG detector and predict one of these attributes on new data. Below if a figure which shows the 10 best shots of the child (lots of smiling and eye contact) and the worst 10 shots (bad lighting, blur, red-eye, no eye contact).


10 good shots, 10 worst shots. From MirrorMirror Paper.

You can also collect a video of yourself as you go through a sequence of different emotions, get people to label frames, and build a system which can predict an attribute such as "seriousness".

Faces ranked from Most serious to least serious. From MirrorMirror Paper.


In this work, labeling was necessary for taking better selfies.  But if half of the world is taking pictures, while the other half is voting pictures up and down (or Tinder-style swiping left and right), then I think the data collection and data labeling effort won't be a big issue in years to come. Nevertheless, this is a cool way of scoring your photos. Regarding consumer applications, this is something that Google, Snapchat, and Facebook will probably integrate into their products very soon.

Mirror Mirror: Crowdsourcing Better Portraits. Jun-Yan Zhu, Aseem Agarwala, Alexei A. Efros, Eli Shechtman and Jue Wang. In ACM Transactions on Graphics (SIGGRAPH Asia), 2014.

Part III. What does it all mean? I'm ready for the cat pictures.


This final section revisits an old, simple, and powerful trick in computer vision and graphics. If you know how to compute the average of a sequence of numbers, then you'll have no problem understanding what an average image (or "mean image") is all about. And if you're read this far, don't worry, the cat picture is coming soon.

Computing average images (or "mean" images) is one of those tricks that I was introduced to very soon after I started working at CMU.  Antonio Torralba, who has always had "a few more visualization tricks" up his sleeve, started computing average images (in the early 2000s) to analyze scenes as well as datasets collected as part of the LabelMe project at MIT. There's really nothing more to the basic idea beyond simply averaging a bunch of pictures.

Teaser Image from AverageExplorer paper.

Usually this kind of averaging is done informally in research, to make some throwaway graphic, or make cool web-ready renderings.  It's great seeing an entire paper dedicated to a system which explores the concept of averaging even further. It took about 15 years of use until somebody was bold enough to write a paper about it. When you perform a little bit of alignment, the mean pictures look really awesome. Check out these cats!



Aligned cat images from the AverageExplorer paper. 
I want one! (Both the algorithm and a Platonic cat)

The AverageExplorer paper extends simple image average with some new tricks which make the operations much more effective. I won't say much about the paper (the link is below), just take at a peek at some of the coolest mean cats I've ever seen (visualized above) or a jaw-dropping way to look at community collected landmark photos (Oxford bridge mean image visualized below).

Aligned bridges from AverageExplorer paper. 
I wish Google would make all of Street View look like this.


Averaging images is a really powerful idea.  Want to know what your magical classifier is tuned to detect?  Compute the top detections and average them.  Soon enough you'll have a good idea of what's going on behind the scenes.

Conclusion


Allow me to mention the mastermind that helped bring most of these vision+graphics+learning applications to life.  There's an inimitable charm present in all of the works of Prof. Alyosha Efros -- a certain aesthetic that is missing from 2015's overly empirical zeitgeist.  He used to be at CMU, but recently moved back to Berkeley.

Being able to summarize several of years worth of research into a single computer generated graphic can go a long way to making your work memorable and inspirational. And maybe our lives don't need that much automation.  Maybe general purpose object recognition is too much? Maybe all we need is a little art? I want to leave you with a YouTube video from a recent 2015 lecture by Professor A.A. Efros titled "Making Visual Data a First-Class Citizen." If you want to hear the story in the master's own words, grab a drink and enjoy the lecture.

"Visual data is the biggest Big Data there is (Cisco projects that it will soon account for over 90% of internet traffic), but currently, the main way we can access it is via associated keywords. I will talk about some efforts towards indexing, retrieving, and mining visual data directly, without the use of keywords." ― A.A. Efros, Making Visual Data a First-Class Citizen



Wednesday, April 08, 2015

Deep Learning vs Probabilistic Graphical Models vs Logic

Today, let's take a look at three paradigms that have shaped the field of Artificial Intelligence in the last 50 years: Logic, Probabilistic Methods, and Deep Learning. The empirical, "data-driven", or big-data / deep-learning ideology triumphs today, but that wasn't always the case. Some of the earliest approaches to AI were based on Logic, and the transition from logic to data-driven methods has been heavily influenced by probabilistic thinking, something we will be investigating in this blog post.

Let's take a look back Logic and Probabilistic Graphical Models and make some predictions on where the field of AI and Machine Learning is likely to go in the near future. We will proceed in chronological order.

Image from Coursera's Probabilistic Graphical Models course

1. Logic and Algorithms (Common-sense "Thinking" Machines)


A lot of early work on Artificial Intelligence was concerned with Logic, Automated Theorem Proving, and manipulating symbols. It should not be a surprise that John McCarthy's seminal 1959 paper on AI had the title "Programs with common sense."

If we peek inside one of most popular AI textbooks, namely "Artificial Intelligence: A Modern Approach," we immediately notice that the beginning of the book is devoted to search, constraint satisfaction problems, first order logic, and planning. The third edition's cover (pictured below) looks like a big chess board (because being good at chess used to be a sign of human intelligence), features a picture of Alan Turing (the father of computing theory) as well as a picture of Aristotle (one of the greatest classical philosophers which had quite a lot to say about intelligence).

The cover of AIMA, the canonical AI text for undergraduate CS students

Unfortunately, logic-based AI brushes the perception problem under the rug, and I've argued quite some time ago that understanding how perception works is really the key to unlocking the secrets of intelligence. Perception is one of those things which is easy for humans and immensely difficult for machines. (To read more see my 2011 blog post, Computer Vision is Artificial Intelligence). Logic is pure and traditional chess-playing bots are very algorithmic and search-y, but the real world is ugly, dirty, and ridden with uncertainty.

I think most contemporary AI researchers agree that Logic-based AI is dead. The kind of world where everything can be perfectly observed, a world with no measurement error, is not the world of robotics and big-data.  We live in the era of machine learning, and numerical techniques triumph over first-order logic.  As of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent.

Logic is great for the classroom and I suspect that once enough perception problems become "essentially solved" that we will see a resurgence in Logic.  And while there will be plenty of open perception problems in the future, there will be scenarios where the community can stop worrying about perception and start revisiting these classical ideas. Perhaps in 2020.

Further reading: Logic and Artificial Intelligence from the Stanford Encyclopedia of Philosophy

2. Probability, Statistics, and Graphical Models ("Measuring" Machines)


Probabilistic methods in Artificial Intelligence came out of the need to deal with uncertainty. The middle part of the Artificial Intelligence a Modern Approach textbook is called "Uncertain Knowledge and Reasoning" and is a great introduction to these methods.  If you're picking up AIMA for the first time, I recommend you start with this section. And if you're a student starting out with AI, do yourself a favor and don't skimp on the math.

Intro to PDFs from Penn State's Probability Theory and Mathematical Statistics course

When most people think about probabilistic methods they think of counting.  In laymen's terms it's fair to think of probabilistic methods as fancy counting methods.  Let's briefly take a look at what used to be the two competing methods for thinking probabilistically.

Frequentist methods are very empirical -- these methods are data-driven and make inferences purely from data.  Bayesian methods are more sophisticated and combine data-driven likelihoods with magical priors.  These priors often come from first principles or "intuitions" and the Bayesian approach is great for combining heuristics with data to make cleverer algorithms -- a nice mix of the rationalist and empiricist world views.

What is perhaps more exciting that then Frequentist vs. Bayesian flamewar, is something known as Probabilistic Graphical Models.  This class of techniques comes from computer science, and even though Machine Learning is now a strong component of a CS and a Statistics degree, the true power of statistics only comes when it is married with computation.

Probabilistic Graphical Models are a marriage of Graph Theory with Probabilistic Methods and they were all the rage among Machine Learning researchers in the mid 2000s. Variational methods, Gibbs Sampling, and Belief Propagation were being pounded into the brains of CMU graduate students when I was in graduate school (2005-2011) and provided us with a superb mental framework for thinking about machine learning problems. I learned most of what I know about Graphical Models from Carlos Guestrin and Jonathan Huang. Carlos Guestrin is now the CEO of GraphLab, Inc (now known as Dato) which is a company that builds large scale products for machine learning on graphs and Jonathan Huang is a senior research scientist at Google.

The video below is a high level overview of GraphLab, but it serves a very nice overview of "graphical thinking" and how it fits into the modern data scientist's tool-belt. Carlos is an excellent lecturer and his presentation is less about the company's product and more about ways for thinking about next generation machine learning systems.

A Computational Introduction to Probabilistic Graphical Models
by GraphLab, Inc CEO Prof. Carlos Guestrin

If you think that deep learning is going to solve all of your machine learning problems, you should really take a look at the above video.  If you're building recommender systems, an analytics platform for healthcare data, designing a new trading algorithm, or building the next generation search engine, Graphical Models are perfect place to start.

Further reading:
Belief Propagation Algorithm Wikipedia Page
An Introduction to Variational Methods for Graphical Models by Michael Jordan et al.
Michael Jordan's webpage (one of the titans of inference and graphical models)

3. Deep Learning and Machine Learning (Data-Driven Machines)

Machine Learning is about learning from examples and today's state-of-the-art recognition techniques require a lot of training data, a deep neural network, and patience. Deep Learning emphasizes the network architecture of today's most successful machine learning approaches.  These methods are based on "deep" multi-layer neural networks with many hidden layers. NOTE: I'd like to emphasize that using deep architectures (as of 2015) is not new.  Just check out the following "deep" architecture from 1998.

LeNet-5 Figure From Yann LeCun's seminal "Gradient-based learning
applied to document recognition" paper.

When you take a look at modern guide about LeNet, it comes with the following disclaimer:

"To run this example on a GPU, you need a good GPU. It needs at least 1GB of GPU RAM. More may be required if your monitor is connected to the GPU.

When the GPU is connected to the monitor, there is a limit of a few seconds for each GPU function call. This is needed as current GPUs can’t be used for the monitor while doing computation. Without this limit, the screen would freeze for too long and make it look as if the computer froze. This example hits this limit with medium-quality GPUs. When the GPU isn’t connected to a monitor, there is no time limit. You can lower the batch size to fix the time out problem."

It really makes me wonder how Yann was able to get anything out of his deep model back in 1998. Perhaps it's not surprising that it took another decade for the rest of us to get the memo.

UPDATE: Yann pointed out (via a Facebook comment) that the ConvNet work dates back to 1989. "It had about 400K connections and took about 3 weeks to train on the USPS dataset (8000 training examples) on a SUN4 machine." -- LeCun



NOTE: At roughly the same time (~1998) two crazy guys in California were trying to cache the entire internet inside the computers in their garage (they started some funny-sounding company which starts with a G). I don't know how they did it, but I guess sometimes to win big you have to do things that don't scale. Eventually the world will catch up.

Further reading:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognitionProceedings of the IEEE, November 1998.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989

Deep Learning code: Modern LeNet implementation in Theano and docs.


Conclusion

I don't see traditional first-order logic making a comeback anytime soon. And while there is a lot of hype behind deep learning, distributed systems and "graphical thinking" is likely to make a much more profound impact on data science than heavily optimized CNNs. There is no reason why deep learning can't be combined with a GraphLab-style architecture, and some of the new exciting machine learning work in the next decade is likely to be a marriage of these two philosophies.


You can also check out a relevant post from last month:
Deep Learning vs Machine Learning vs Pattern Recognition

Discuss on Hacker News

Saturday, April 04, 2015

Three Fundamental Dimensions for Thinking About Machine Learning Systems

Today, let's set cutting-edge machine learning and computer vision techniques aside. You probably already know that computer vision (or "machine vision") is the branch of computer science / artificial intelligence concerned with recognizing objects like cars, faces, and hand gestures in images. And you also probably know that Machine Learning algorithms are used to drive state-of-the-art computer vision systems. But what's missing is a birds-eye view of how to think about designing new learning-based systems. So instead of focusing on today's trendiest machine learning techniques, let's go all the way back to day 1 and build ourselves a strong foundation for thinking about machine learning and computer vision systems.





Allow me to introduce three fundamental dimensions which you can follow to obtain computer vision masterdom. The first dimension is mathematical, the second is verbal, and the third is intuitive.

On a personal level, most of my daily computer vision activities directly map onto these dimensions. When I'm at a coffee shop, I prefer the mathematical - pen and paper are my weapons of choice. When it's time to get ideas out of my head, there's nothing like a solid founder-founder face-to-face meeting, an occasional MIT visit to brainstorm with my scientist colleagues, or simply rubberducking (rubber duck debugging) with developers. And when it comes to engineering, interacting with a live learning system can help develop the intuition necessary to make a system more powerful, more efficient, and ultimately much more robust.

Mathematical: Learn to love the linear classifier

At the core of machine learning is mathematics, so you shouldn't be surprised that I include mathematical as one of the three fundamental dimensions of thinking about computer vision.

The single most important concept in all of machine learning which you should master is the idea of the classifier. For some of you, classification is a well-understood problem; however, too many students prematurely jump into more complex algorithms line randomized decision forests and multi-layer neural networks, without first grokking the power of the linear classifier. Plenty of data scientists will agree that the linear classifier is the most fundamental machine learning algorithm. In fact, when Peter Norvig, Director of Research at Google, was asked "Which AI field has surpassed your expectations and surprised you the most?" in his 2010 interview, he answered with "machine learning by linear separators." 

The illustration below depicts a linear classifier. In two dimensions, a linear classifier is a line which separates the positive examples from the negative examples.  You should first master the 2D linear classifier, even though in most applications you'll need to explore a higher-dimensional feature space. My personal favorite learning algorithm is the linear support vector machine, or linear SVM. In a SVM, overly-confident data points do not influence the decision boundary. Or put in another way, learning with these confident points is like they aren't even there! This is a very useful property for large-scale learning problems where you can't fit all data into memory. You're going to want to master the linear SVM (and how it relates to Linear Discriminant Analysis, Linear Regression, and Logistic Regression) if you're going to pass one of my whiteboard data-science interviews.


Linear Support Vector Machine from the SVM Wikipedia page


An intimate understanding of the linear classifier is necessary to understand how deep learning systems work.  The neurons inside a multi-layer neural network are little linear classifiers, and while the final decision boundary is non-linear, you should understand the underlying primitives very well. Loosely speaking, you can think of the linear classifier as a simple spring system and a more complex classifiers as a higher-order assembly of springs.


Also, there are going to be scenarios in your life as a data-scientist where a linear classifier should be the first machine learning algorithm you try. So don't be afraid to use some pen and paper, get into that hinge loss, and master the fundamentals.

Further reading: Google's Research Director talks about Machine Learning. Peter Norvig's Reddit AMA on YouTube from 2010.
Further reading: A demo for playing with linear classifiers in the browser. Linear classifier Javascript demo from Stanford's CS231n: Convolutional Neural Networks for Visual Recognition.
Further reading: My blog post: Deep Learning vs Machine Learning vs Pattern Recognition

Verbal: Talk about you vision (and join a community)

As you start acquiring knowledge of machine learning concepts, the best way forward is to speak up. Learn something, then teach a friend. As counterintuitive as it sounds, when it comes down to machine learning mastery, human-human interaction is key. This is why getting a ML-heavy Masters or PhD degree is ultimately the best bet for those adamant about becoming pioneers in the field. Daily conversations are necessary to strengthen your ideas.  See Raphael's "The School of Athens" for a depiction of what I think of as the ideal learning environment.  I'm sure half of those guys were thinking about computer vision.


An ideal ecosystem for collaboration and learning about computer vision


If you're not ready for a full-time graduate-level commitment to the field, consider a.) taking an advanced undergraduate course in vision/learning from your university, b.) a machine learning MOOC, or c.) taking part in a practical and application-focused online community/course focusing on computer vision.

During my 12-year academic stint, I made the observation that talking to your peers about computer vision and machine learning is more important that listening to teachers/supervisors/mentors.  Of course, there's much value in having a great teacher, but don't be surprised if you get 100x more face-to-face time with your friends compared to student-teacher interactions.  So if you take an online course like Coursera's Machine Learning MOOC, make sure to take it with friends.  Pause the video and discuss. Go to dinner and discuss. Write some code and discuss. Rinse, lather, repeat.

Coursera's Machine Learning MOOC taught by Andrew Ng


Another great opportunity is to follow Adrian Rosebrock's pyimagesearch.com blog, where he focuses on python and computer vision applications.  

Further reading: Old blog post: Why your vision lab needs a reading group

Homework assignment: First somebody on the street and teach them about machine learning.

Intuitive: Play with a real-time machine learning system

The third and final dimension is centered around intuition. Intuition is the ability to understand something immediately, without the need for conscious reasoning. The following guidelines are directed towards real-time object detection systems, but can also transfer over to other applications like learning-based attribution models for advertisements, high-frequency trading, as well as numerous tasks in robotics.

To gain some true insights about object detection, you should experience a real-time object detection system.  There's something unique about seeing a machine learning system run in real-time, right in front of you.  And when you get to control the input to the system, such as when using a webcam, you can learn a lot about how the algorithms work.  For example, seeing the classification score go down as you occlude the object of interest, and seeing the detection box go away when the object goes out of view is fundamental to building intuition about what works and what elements of a system need to improve.

I see countless students tweaking an algorithm, applying it to a static large-scale dataset, and then waiting for the precision-recall curve to be generated. I understand that this is the hard and scientific way of doing things, but unless you've already spent a few years making friends with every pixel, you're unlikely to make a lasting contribution this way. And it's not very exciting -- you'll probably fall asleep at your desk.

Using a real-time feedback loop (see illustration below), you can learn about the patterns which are intrinsically difficult to classify, as well what environmental variations (lights, clutter, motion) affect your system the most.  This is something which really cannot be done with a static dataset.  So go ahead, mine some intuition and play.
Visual Debugging: Designing the vision.ai real-time gesture-based controller in Fall 2013

Visual feedback is where our work at vision.ai truly stands out. Take a look at the following video, where we show a live example of training and playing with a detector based on vision.ai's VMX object recognition system.


NOTE: There a handful of other image recognition systems out there which you can turn into real-time vision systems, but be warned that optimization for real-time applications requires some non-trivial software engineering experience.  We've put a lot of care into our system so that the detection scores are analogous to a linear SVM scoring strategy. Making the output of a non-trivial learning algorithm backwards-compatible with a linear SVM isn't always easy, but in my opinion, well-worth the effort.

Extra Credit: See comments below for some free VMX by vision.ai beta software licenses so you can train some detectors using our visual feedback interface and gain your own machine vision intuition.

Conclusion

The three dimensions, namely mathematical, verbal, and intuitive provide different ways for advancing your knowledge of machine learning and computer vision systems.  So remember to love the linear classifier, talk to your friends, and use a real-time feedback loop when designing your machine learning system.




Thursday, March 26, 2015

Venture Pitch Contest at CVPR 2015 in Boston, MA

This year's CVPR will be in Boston, and as always, I expect it to be the single best venue to meet computer vision experts and see cutting edge research. I expect Google and Facebook to show off their best Deep Learning systems, NVIDIA to demo their newest GPUs, and dozens of computer vision startups to be looking for talent to grow their teams.


I expect the entrepreneur/academic ratio to be much higher, as it is getting easier for PhD students and postdocs to start their own companies.  This year's CVPR will even feature a Venture Pitch Contest as part of the Fourth Annual Vision Industry and Entrepreneur (VIEW) Workshop at CVPR. From the VIEW workshop webpage:

Computer vision as a technology is penetrating the industry at an extraordinary pace with many computer vision applications directly becoming consumer commodities. Both startups and big companies have contributed to this trend. At the fourth annual Vision Industry and Entrepreneur Workshop, we are organizing a first of its kind Startup Pitch Contest. As a computer vision innovator, this is your chance to present the next great computer vision product idea to a distinguished panel of judges which will include Venture Capitalists, Investors and leading Researchers in the field.
Applications should employ novel computer vision technologies towards an innovative product. The best submissions would be selected for an Elevator Pitch presentation in front of the judges. Prizes would be awarded to the winners who would be announced at the end of the workshop. The details about the judging criteria will be posted on the website.
The submission is broken into two phases – Preliminary submission consisting of a title and an abstract, and, Final submission consisting of a one page summary with technology overview, feasibility, outreach (customers and market size) and monetization (business model). The summary should be tailored at soliciting funding from sources such as venture capital to invest in the idea. The applicants should indicate whether they are academic researchers or industry professionals. Only non-confidential material may be submitted.

Even if you're not ready to pitch, you can submit a poster or demo to the Industry Session part of the VIEW 2015 Workshop. Great place to show off your new computer vision-powered app.  One of the organizers, Samson Timoner, told me the deadlines for submission have been extended. Here are the new dates:

Submission: April 3, 2015 (extended)
Notification: April 8, 2015 (extended)
Workshop: June 11, 2015

This year's CVPR is going to be a great place to network with startups, share ideas, see cutting-edge research and (NEW in 2015) meet folks from the venture capital world. Who knows, if I'm there, I might be wearing a vision.ai T-shirt.

Mobileye's quest to put Deep Learning inside every new car

In Amnon Shashua's vision of the future, every car can see.  He's convinced that the key technology behind the imminent driving revolution is going to be computer vision, and to experience this technology, we won't have to wait for fully autonomous cars to become mainstream.  I had the chance to hear Shashua's vision of the future this past Monday, and from what I'm about to tell you, it looks like there's going to be a whole lot of Deep Learning inside tomorrow's carCars equipped with Deep Learning-based pedestrian avoidance systems (See Figure 1) can sense people and dangerous situations while you're behind the wheel. From winning large-scale object recognition competitions like ImageNet, to heavy internal use by Google, Deep Learning is now at the foundation of many hi-tech startups and giants. And when it comes to cars, Deep Learning promises to give us both safer roads and the highly-anticipated hands-free driving experience. 

Mobileye's Deep Learning-based Pedestrian Detector

Mobileye Co-founder Amnon Shashua shares his vision during an invited lecture at MIT
Amnon Shashua is the Co-founder & CTO of Mobileye and this past Monday (March 23, 2015) he gave a compelling talk at MIT’s Brains, Minds & Machines Seminar Series titled “Computer Vision that is Changing Our Lives”. Shashua discussed Mobileye’s Deep Learning chips, robots, autonomous driving, as well as introduced his most recent project, a wearable computer vision unit called OrCam


Fig 2. Prof Amnon Shashua, CTO of Mobileye

Let's take a deeper look at the man behind Mobileye and his vision. Below is my summary of Shashua's talk as well as some personal insights regarding Mobileye's embedded computer vision technology and how it relates to cloud-based computer vision.

Mobileye's academic roots
You might have heard stories of bold entrepreneurs dropping out of college to form million dollar startups, but this isn't one of them.  This is the story of a professor who turned his ideas into a publicly traded company, Mobileye (NYSE:MBLY). Amnon Shashua is a Professor at Hebrew University, and his lifetime achievements suggest that for high-tech entrepreneurship, it is pretty cool to stay in school. And while Shashua and I never overlapped academically (he is 23 years older than me), both of us spent some time at MIT as postdoctoral researchers.

Deep Learning's impact on Mobileye
During his presentation at MIT, Amnon Shashua showcased a wide array of of computer vision problems that are currently being solved by Mobileye real-time computer vision systems. These systems are image-based and do not require expensive 3D sensors such as the ones commonly found on top of self-driving cars.  He showed videos of real-time lane detection, pedestrian detection, animal detection, and road surface detection. I have seen many similar visualizations during my academic career; however, Shashua emphasized that deep learning is now used to power most of Mobileye's computer vision systems

Question: I genuinely wonder how much the shift to Deep methods improved Mobileye's algorithms, or if the move is a strategic technology upgrade to stay relevant in the era where Google and and competition is feverishly pouncing on the landscape of deep learning. There's a lot of competition on the hardware front, and it seems like the chase for ASIC-like Deep Learning Miners/Trainers is on.


The AlexNet CNN diagram from the popular Krizhevsky/Sutskever/Hinton paper. Shashua explicitly mentioned the AlexNet model during his MIT talk, and it appears that Mobileye has done their Deep Learning homework.

The early Mobileye: Mobileye didn’t wait for the deep learning revolution to happen. They started shipping computer vision technology for vehicles using traditional techniques more than a decade ago. In fact, I attended a Mobileye presentation at CMU almost a full decade ago -- it was given by Andras Ferencz at the 2005 CMU VASC Seminar.  This week's talk by Shashua suggests that Mobileye was able to successfully modernize their algorithms to use deep learning.

Further reading: To learn about object recognition methods in computer vision which were popular before Deep Learning, see my January blog post, titled From feature descriptors to deep learning: 20 years of computer vision.


Fig 3. "Deep Learning at Mobileye" presentation at the 2015 Deutsche Bank Global 
Auto Industry Conference.

Mobileye's custom Computer Vision hardware
Mobileye is not a software computer vision company -- they bake their algorithms into custom computer vision chips. Shashua reported some impressive computation speeds on what appears to be tiny vision chips. Their custom hardware is more specific than GPUs (which are quite common for deep learning, scientific computations, computer graphics, and actually affordable). But Mobileye chips do not need to perform the computationally expensive big-data training stage onboard, so their devices can be much leaner than GPUs. Mobileye has lots of hardware experience, and regarding machine learning, Shashua mentioned that Mobileye has more vehicle-related training data than they know what to do with.  

Fig 4. The Mobileye Q2 lane detection chip.


Embedded vs. Cloud-based computer vision
While Mobileye makes a strong case for embedded computer vision, there are many scenarios today where the alternative cloud-based computer vision approach triumphs.  Cloud-based computer vision is about delivering powerful algorithms as a service, over the web.  In a cloud-based architecture, the algorithms live in a data center and applications talk to the vision backend via an API layer.  And while certain mission-critical applications cannot have a cloud-component (e.g., a drones flying over the desert), cloud-based vision system promise to turn laptops and smartphones into smart devices, without the need to bake algorithms into chips. In-home surveillance apps, home-automation apps, exploratory robotics projects, and even scientific research can benefit from cloud-based computer vision.  Most importantly, cloud-based deployment means that startups can innovate faster, and entire products can evolve much faster.

Unlike Mobileye's decade-long journey, I suspect cloud-based computer vision platforms are going to make computer vision development much faster, giving developers a Heroku-like button for visual AI.  Choosing diverse compilation targets such as a custom chip or Javascript will be handled by the computer vision platform, allowing computer vision developers to work smarter and deploy to more devices.

Conclusion and Predictions
Even if you don't believe that today's computer vision-based safety features make cars smart enough to call them robots, driving tomorrow's car is sure going to feel different.  I will leave you with one final note: Mobileye's CTO hinted that if you are going to design a car in 2015 on top of computer vision tech, you might reconsider traditional safety features such as airbags, and create a leaner, less-expensive AI-enabled vehicle.


Fig 5. Mobileye technology illustration [safety.trw.com].


Watch the Mobileye presentation on YouTube: If you are interested in embedded deep learning, autonomous vehicles, or want to get a taste of how the industry veterans compile their deep networks into chips, you can watch the full 38-minute presentation from Amnon's January 2015 Mobileye presentation. 



I hope you learned a little bit about vehicle computer vision systems, embedded Deep Learning, and got a glimpse of the visual intelligence revolution that is happening today. Feel free to comment below, follow me on Twitter (@quantombone), or sign-up to the vision.ai mailing list if you are a developer interested in taking vision.ai's cloud-based computer vision platform for a spin.

Friday, March 20, 2015

Deep Learning vs Machine Learning vs Pattern Recognition

Lets take a close look at three related terms (Deep Learning vs Machine Learning vs Pattern Recognition), and see how they relate to some of the hottest tech-themes in 2015 (namely Robotics and Artificial Intelligence). In our short journey through jargon, you should acquire a better understanding of how computer vision fits in, as well as gain an intuitive feel for how the machine learning zeitgeist has slowly evolved over time.

Fig 1. Putting a human inside a computer is not Artificial Intelligence
(Photo from WorkFusion Blog)

If you look around, you'll see no shortage of jobs at high-tech startups looking for machine learning experts. While only a fraction of them are looking for Deep Learning experts, I bet most of these startups can benefit from even the most elementary kind of data scientist. So how do you spot a future data-scientist? You learn how they think. 

The three highly-related "learning" buzz words

“Pattern recognition,” “machine learning,” and “deep learning” represent three different schools of thought.  Pattern recognition is the oldest (and as a term is quite outdated). Machine Learning is the most fundamental (one of the hottest areas for startups and research labs as of today, early 2015). And Deep Learning is the new, the big, the bleeding-edge -- we’re not even close to thinking about the post-deep-learning era.  Just take a look at the following Google Trends graph.  You'll see that a) Machine Learning is rising like a true champion, b) Pattern Recognition started as synonymous with Machine Learning, c) Pattern Recognition is dying, and d) Deep Learning is new and rising fast.



1. Pattern Recognition: The birth of smart programs

Pattern recognition was a term popular in the 70s and 80s. The emphasis was on getting a computer program to do something “smart” like recognize the character "3". And it really took a lot of cleverness and intuition to build such a program. Just think of "3" vs "B" and "3" vs "8".  Back in the day, it didn’t really matter how you did it as long as there was no human-in-a-box pretending to be a machine. (See Figure 1)  So if your algorithm would apply some filters to an image, localize some edges, and apply morphological operators, it was definitely of interest to the pattern recognition community.  Optical Character Recognition grew out of this community and it is fair to call “Pattern Recognition” as the “Smart" Signal Processing of the 70s, 80s, and early 90s. Decision trees, heuristics, quadratic discriminant analysis, etc all came out of this era. Pattern Recognition become something CS folks did, and not EE folks.  One of the most popular books from that time period is the infamous invaluable Duda & Hart "Pattern Classification" book and is still a great starting point for young researchers.  But don't get too caught up in the vocabulary, it's a bit dated.



The character "3" partitioned into 16 sub-matrices. Custom rules, custom decisions, and custom "smart" programs used to be all the rage. 


QuizThe most popular Computer Vision conference is called CVPR and the PR stands for Pattern Recognition.  Can you guess the year of the first CVPR conference?

2. Machine Learning: Smart programs can learn from examples

Sometime in the early 90s people started realizing that a more powerful way to build pattern recognition algorithms is to replace an expert (who probably knows way too much about pixels) with data (which can be mined from cheap laborers).  So you collect a bunch of face images and non-face images, choose an algorithm, and wait for the computations to finish.  This is the spirit of machine learning.  "Machine Learning" emphasizes that the computer program (or machine) must do some work after it is given data.  The Learning step is made explicit.  And believe me, waiting 1 day for your computations to finish scales better than inviting your academic colleagues to your home institution to design some classification rules by hand.

"What is Machine Learning" from Dr Natalia Konstantinova's Blog. The most important part of this diagram are the "Gears" which suggests that crunching/working/computing is an important step in the ML pipeline.

As Machine Learning grew into a major research topic in the mid 2000s, computer scientists began applying these ideas to a wide array of problems.  No longer was it only character recognition, cat vs. dog recognition, and other “recognize a pattern inside an array of pixels” problems.  Researchers started applying Machine Learning to Robotics (reinforcement learning, manipulation, motion planning, grasping), to genome data, as well as to predict financial markets.  Machine Learning was married with Graph Theory under the brand “Graphical Models,” every robotics expert had no choice but to become a Machine Learning Expert, and Machine Learning quickly became one of the most desired and versatile computing skills.  However "Machine Learning" says nothing about the underlying algorithm.  We've seen convex optimization, Kernel-based methods, Support Vector Machines, as well as Boosting have their winning days.  Together with some custom manually engineered features, we had lots of recipes, lots of different schools of thought, and it wasn't entirely clear how a newcomer should select features and algorithms.  But that was all about to change...

Further reading: To learn more about the kinds of features that were used in Computer Vision research see my blog post: From feature descriptors to deep learning: 20 years of computer vision.

3. Deep Learning: one architecture to rule them all

Fast forward to today and what we’re seeing is a large interest in something called Deep Learning. The most popular kinds of Deep Learning models, as they are using in large scale image recognition tasks, are known as Convolutional Neural Nets, or simply ConvNets. 

ConvNet diagram from Torch Tutorial

Deep Learning emphasizes the kind of model you might want to use (e.g., a deep convolutional multi-layer neural network) and that you can use data fill in the missing parameters.  But with deep-learning comes great responsibility.  Because you are starting with a model of the world which has a high dimensionality, you really need a lot of data (big data) and a lot of crunching power (GPUs). Convolutions are used extensively in deep learning (especially computer vision applications), and the architectures are far from shallow.

If you're starting out with Deep Learning, simply brush up on some elementary Linear Algebra and start coding.  I highly recommend Andrej Karpathy's Hacker's guide to Neural Networks. Implementing your own CPU-based backpropagation algorithm on a non-convolution based problem is a good place to start.

There are still lots of unknowns. The theory of why deep learning works is incomplete, and no single guide or book is better than true machine learning experience.  There are lots of reasons why Deep Learning is gaining popularity, but Deep Learning is not going to take over the world.  As long as you continue brushing up on your machine learning skills, your job is safe. But don't be afraid to chop these networks in half, slice 'n dice at will, and build software architectures that work in tandem with your learning algorithm.  The Linux Kernel of tomorrow might run on Caffe (one of the most popular deep learning frameworks), but great products will always need great vision, domain expertise, market development, and most importantly: human creativity.

Other related buzz-words

Big-data is the philosophy of measuring all sorts of things, saving that data, and looking through it for information.  For business, this big-data approach can give you actionable insights.  In the context of learning algorithms, we’ve only started seeing the marriage of big-data and machine learning within the past few years.  Cloud-computing, GPUs, DevOps, and PaaS providers have made large scale computing within reach of the researcher and ambitious "everyday" developer. 

Artificial Intelligence is perhaps the oldest term, the most vague, and the one that was gone through the most ups and downs in the past 50 years. When somebody says they work on Artificial Intelligence, you are either going to want to laugh at them or take out a piece of paper and write down everything they say.

Further reading: My 2011 Blog post Computer Vision is Artificial Intelligence.

Conclusion

Machine Learning is here to stay. Don't think about it as Pattern Recognition vs Machine Learning vs Deep Learning, just realize that each term emphasizes something a little bit different.  But the search continues.  Go ahead and explore. Break something. We will continue building smarter software and our algorithms will continue to learn, but we've only begun to explore the kinds of architectures that can truly rule-them-all.

If you're interested in real-time vision applications of deep learning, namely those suitable for robotic and home automation applications, then you should check out what we've been building at vision.ai. Hopefully in a few days, I'll be able to say a little bit more. :-)

Until next time.




Tuesday, January 20, 2015

From feature descriptors to deep learning: 20 years of computer vision

We all know that deep convolutional neural networks have produced some stellar results on object detection and recognition benchmarks in the past two years (2012-2014), so you might wonder: what did the earlier object recognition techniques look like? How do the designs of earlier recognition systems relate to the modern multi-layer convolution-based framework?

Let's take a look at some of the big ideas in Computer Vision from the last 20 years.

The rise of the local feature descriptors: ~1995 to ~2000
When SIFT (an acronym for Scale Invariant Feature Transform) was introduced by David Lowe in 1999, the world of computer vision research changed almost overnight. It was robust solution to the problem of comparing image patches. Before SIFT entered the game, people were just using SSD (sum of squared distances) to compare patches and not giving it much thought.
The SIFT recipe: gradient orientations, normalization tricks

SIFT is something called a local feature descriptor -- it is one of those research findings which is the result of one ambitious man hackplaying with pixels for more than a decade.  Lowe and the University of British Columbia got a patent on SIFT and Lowe released a nice compiled binary of his very own SIFT implementation for researchers to use in their work.  SIFT allows a point inside an RGB imagine to be represented robustly by a low dimensional vector.  When you take multiple images of the same physical object while rotating the camera, the SIFT descriptors of corresponding points are very similar in their 128-D space.  At first glance it seems silly that you need to do something as complex as SIFT, but believe me: just because you, a human, can look at two image patches and quickly "understand" that they belong to the same physical point, this is not the same for machines.  SIFT had massive implications for the geometric side of computer vision (stereo, Structure from Motion, etc) and later became the basis for the popular Bag of Words model for object recognition.

Seeing a technique like SIFT dramatically outperform an alternative method like Sum-of-Squared-Distances (SSD) Image Patch Matching firsthand is an important step in every aspiring vision scientist's career. And SIFT isn't just a vector of filter bank responses, the binning and normalization steps are very important. It is also worthwhile noting that while SIFT was initially (in its published form) applied to the output of an interest point detector, later it was found that the interest point detection step was not important in categorization problems.  For categorization, researchers eventually moved towards vector quantized SIFT applied densely across an image.

I should also mention that other descriptors such as Spin Images (see my 2009 blog post on spin images) came out a little bit earlier than SIFT, but because Spin Images were solely applicable to 2.5D data, this feature's impact wasn't as great as that of SIFT. 

The modern dataset (aka the hardening of vision as science): ~2000 to ~2005
Homography estimation, ground-plane estimation, robotic vision, SfM, and all other geometric problems in vision greatly benefited from robust image features such as SIFT.  But towards the end of the 1990s, it was clear that the internet was the next big thing.  Images were going online. Datasets were being created.  And no longer was the current generation solely interested in structure recovery (aka geometric) problems.  This was the beginning of the large-scale dataset era with Caltech-101 slowly gaining popularity and categorization research on the rise. No longer were researchers evaluating their own algorithms on their own in-house datasets -- we now had a more objective and standard way to determine if yours is bigger than mine.  Even though Caltech-101 is considered outdated by 2015 standards, it is fair to think of this dataset as the Grandfather of the more modern ImageNet dataset. Thanks Fei-Fei Li.

Category-based datasets: the infamous Caltech-101 TorralbaArt image

Bins, Grids, and Visual Words (aka Machine Learning meets descriptors): ~2000 to ~2005
After the community shifted towards more ambitious object recognition problems and away from geometry recovery problems, we had a flurry of research in Bag of Words, Spatial Pyramids, Vector Quantization, as well as machine learning tools used in any and all stages of the computer vision pipeline.  Raw SIFT was great for wide-baseline stereo, but it wasn't powerful enough to provide matches between two distinct object instances from the same visual object category.  What was needed was a way to encode the following ideas: object parts can deform relative to each other and some image patches can be missing.  Overall, a much more statistical way to characterize objects was needed.

Visual Words were introduced by Josef Sivic and Andrew Zisserman in approximately 2003 and this was a clever way of taking algorithms from large-scale text matching and applying them to visual content.  A visual dictionary can be obtained by performing unsupervised learning (basically just K-means) on SIFT descriptors which maps these 128-D real-valued vectors into integers (which are cluster center assignments).  A histogram of these visual words is a fairly robust way to represent images.  Variants of the Bag of Words model are still heavily utilized in vision research.
Josef Sivic's "Video Google": Matching Graffiti inside the Run Lola Run video

Another idea which was gaining traction at the time was the idea of using some sort of binning structure for matching objects.  Caltech-101 images mostly contained objects, so these grids were initially placed around entire images, and later on they would be placed around object bounding boxes.  Here is a picture from Kristen Grauman's famous Pyramid Match Kernel paper which introduced a powerful and hierarchical way of integrating spatial information into the image matching process.

Grauman's Pyramid Match Kernel for Improved Image Matching


At some point it was not clear whether researchers should focus on better features, better comparison metrics, or better learning.  In the mid 2000s it wasn't clear if young PhD students should spend more time concocting new descriptors or kernelizing their support vector machines to death.

Object Templates (aka the reign of HOG and DPM): ~2005 to ~2010
At around 2005, a young researcher named Navneet Dalal showed the world just what can be done with his own new badass feature descriptor, HOG.  (It is sometimes written as HoG, but because it is an acronym for “Histogram of Oriented Gradients” it should really be HOG. The confusion must have came from an earlier approach called DoG which stood for Difference of Gaussian, in which case the “o” should definitely be lower case.)

Navneet Dalal's HOG Descriptor


HOG came at the time when everybody was applying spatial binning to bags of words, using multiple layers of learning, and making their systems overly complicated. Dalal’s ingenious descriptor was actually quite simple.  The seminal HOG paper was published in 2005 by Navneet and his PhD advisor, Bill Triggs. Triggs got his fame from earlier work on geometric vision, and Dr. Dalal got his fame from his newly found descriptor.  HOG was initially applied to the problem of pedestrian detection, and one of the reasons it because so popular was that the machine learning tool used on top of HOG was quite simple and well understood, it was the linear Support Vector Machine.

I should point out that in 2008, a follow-up paper on object detection, which introduced a technique called the Deformable Parts-based Model (or DPM as we vision guys call it), helped reinforce the popularity and strength of the HOG technique. I personally jumped on the HOG bandwagon in about 2008.  My first few years as a grad student (2005-2008) I was hackplaying with my own vector quantized filter bank responses, and definitely developed some strong intuition regarding features.  In the end I realized that my own features were only "okay," and because I was applying them to the outputs of image segmentation algorithms they were extremely slow.  Once I started using HOG, it didn’t take me long to realize there was no going back to custom, slow, features.  Once I started using a multiscale feature pyramid with a slightly improved version of HOG introduced by master hackers such as Ramanan and Felzenszwalb, I was processing images at 100x the speed of multiple segmentations + custom features (my earlier work).
The infamous Deformable Part-based Model (for a Person)

DPM was the reigning champ on the PASCAL VOC challenge, and one of the reasons why it became so popular was the excellent MATLAB/C++ implementation by Ramanan and Felzenszwalb.  I still know many researchers who never fully acknowledged what releasing such great code really meant for the fresh generation of incoming PhD students, but at some point it seems like everybody was modifying the DPM codebase for their own CVPR attempts.  Too many incoming students were lacking solid software engineering skills and giving them the DPM code was a surefire way to get some some experiments up and running.  Personally, I never jumped on the parts-based methodology, but I did take apart the DPM codebase several times.  However, when I put it back together, the Exemplar-SVM was the result.

Big data, Convolutional Neural Networks and the promise of Deep Learning: ~2010 to ~2015
Sometime around 2008, it was pretty clear that scientists were getting more and more comfortable with large datasets.  It wasn’t just the rise of “Cloud Computing” and “Big Data,” it was the rise of the data scientists.  Hacking on equations by morning, developing a prototype during lunch, deploying large scale computations in the evening, and integrating the findings into a production system by sunset.  I spent two summers at Google Research, I saw lots of guys who had made their fame as vision hackers.  But they weren’t just writing “academic” papers at Google -- sharding datasets with one hand, compiling results for their managers, writing Borg scripts in their sleep, and piping results into gnuplot (because Jedis don’t need GUIs?). It was pretty clear that big data, and a DevOps mentality was here to stay, and the vision researcher of tomorrow would be quite comfortable with large datasets.  No longer did you need one guy with a mathy PhD, one software engineer, one manager, and one tester.  Plenty of guys who could do all of those jobs.

Deep Learning: 1980s - 2015
2014 was definitely a big year for Deep Learning.  What’s interesting about Deep Learning is that it is a very old technique.  What we're seeing now is essentially the Neural Network 2.0 revolution -- but this time around, there's we're 20 years ahead R&D-wise and our computers are orders of magnitude faster.  And what’s funny is that the same guys that were championing such techniques in the early 90s were the same guys we were laughing at in the late 90s (because clearly convex methods were superior to the magical NN learning-rate knobs). I guess they really had the last laugh because eventually these relentless neural network gurus became the same guys we now all look up to.  Geoffrey Hinton, Yann LeCun, Andrew Ng, and Yeshua Bengio are the 4 Titans of Deep Learning.  By now, just about everybody has jumped ship to become a champion of Deep Learning.

But with Google, Facebook, Baidu, and a multitude of little startups riding the Deep Learning wave, who will rise to the top as the master of artificial intelligence?


How to today's deep learning systems resemble the recognition systems of yesteryear?
Multiscale convolutional neural networks aren't that much different than the feature-based systems of the past.  The first level neurons in deep learning systems learn to utilize gradients in a way that is similar to hand-crafted features such as SIFT and HOG.  Objects used to be found in a sliding-window fashion, but now it is easier and sexier to think of this operation as convolving an image with a filter. Some of the best detection systems used to use multiple linear SVMs, combined in some ad-hoc way, and now we are essentially using even more of such linear decision boundaries.  Deep learning systems can be thought of a multiple stages of applying linear operators and piping them through a non-linear activation function, but deep learning is more similar to a clever combination of linear SVMs than a memory-ish Kernel-based learning system.

Features these days aren't engineered by hand.  However, architectures of Deep systems are still being designed manually -- and it looks like the experts are the best at this task.  The operations on the inside of both classic and modern recognition systems are still very much the same.  You still need to be clever to play in the game, but now you need a big computer. There's still lot of room for improvement, so I encourage all of you to be creative in your research.

Research-wise, it never hurts to know where we have been before so that we can better plan for our journey ahead.  I hope you enjoyed this brief history lesson and the next time you look for insights in your research, don't be afraid to look back.

To learn more about computer vision techniques:

Some Computer Vision datasets:
Caltech-101 Dataset
ImageNet Dataset

To learn about the people mentioned in this article:
Kristen Grauman (creator of Pyramid Match Kernel, Prof at Univ of Texas)
Bill Triggs's (co-creator of HOG, Researcher at INRIA)
Navneet Dalal (co-creator of HOG, now at Google)
Yann LeCun (one of the Titans of Deep Learning, at NYU and Facebook)
Geoffrey Hinton (one of the Titans of Deep Learning, at Univ of Toronto and Google)
Andrew Ng (leading the Deep Learning effort at Baidu, Prof at Stanford)
Yoshua Bengio (one of the Titans of Deep Learning, Prof at U Montreal)
Deva Ramanan (one of the creators of DPM, Prof at UC Irvine)
Pedro Felzenszwalb (one of the creators of DPM, Prof at Brown)
Fei-fei Li (Caltech101 and ImageNet, Prof at Stanford)
Josef Sivic (Video Google and Visual Words, Researcher at INRIA/ENS)
Andrew Zisserman (Geometry-based methods in vision, Prof at Oxford)
Andrew E. Johnson (SPIN Images creator, Researcher at JPL)
Martial Hebert (Geometry-based methods in vision, Prof at CMU)