Saturday, November 07, 2009

A model of thought: The Associative Indexing of the Memex

The Memex "Memory Extender" is an organizational device, a conceptual device, and a framework for dealing with conceptual relationships in an associative way. Abandoning the Aristotelian tradition of rooting concepts in definitions, the Memex suggests an association-based, non-parametric, and data-driven representation of concepts.

Since the mind=software analogy is so deeply engraved in my thoughts, it is hard for me to see intelligent reasoning as anything but a computer program (albeit one which we might never discover/develop). It is worthwhile to see sketches of the memex from an era before computers. (See Figure below). However, with the modern Internet, a magnificent example of a Bush's ideology, with links denoting the associations between pages, we need no better analogy. Bush's critique of the artificiality of traditional schemes of indexing resonates in the world wide web.


A Mechanical Memex Sketch

By extrapolating Bush's anti-indexing argument to visual object recognition, I realize that the blunder is to assign concepts to rigid categories. The desire to break free from categorization was the chief motivation for my Visual Memex paper. If Bush's ideas were so successful in predicting the modern Internet, we should ask ourselves, "Why are categories so prevalent in computational models of perception?" Maybe it is machine learning, with its own tradition of classes in supervised learning approaches, that has scarred the way we computer scientists see reality.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.” -- Vannevar Bush

Is Google's grasp of the world of information anything more than a Memex? I'm not convinced that it is not. While the feat of searching billions of web pages in real time has already been demonstrated by Google (and reinforced every day), the best computer vision approaches as of today resemble nothing like Google's data-driven way of representing concepts. I'm quite interested in pushing this link-based data-driven mentality to the next level in the field of Computer Vision. Breaking free from the categorization assumptions that plague computational perception might the the key ingredient in the recipe for success.

Instead of summarizing, here is another link to a well-written article on the Memex by Finn Brunton. Quoting Brunton, "The deepest implications of the Memex would begin to become apparent here: not the speed of retrieval, or even the association as such, but the fact that the association is arbitrary and can be shared, which begins to suggest that, at some level, the data itself is also arbitrary within the context of the Memex; that it may not be “the shape of thought,” emphasis on the the, but that it is the shape of a new thought, a mediated and mechanized thought, one that is described by queries and above all by links."

Thursday, November 05, 2009

The Visual Memex: Visual Object Recognition Without Categories


Figure 1
I have discussed the limitations of using rigid object categories in computer vision, and my CVPR 2008 work on Recognition as Association was a move towards developing a category-free model of objects. I was primarily concerned with local object recognition where the recognition problem was driven by the appearance/shape/texture features derived from within a segment (a region extraction from an image using an image segmentation algorithm). Recognition of objects was done locally and independently per region, since I did not have good model of category-free context at that time. I've given the problem of contextual object reasoning much thought over the past several years, and equipped with the power of graphical models and learning algorithms I now present a model for category-free object relationship reasoning.

Now its 2009, and its no surprise that I have a paper on context. Context is the new beast and all the cool kids are using it for scene understanding; however, categories are used so often for this problem that their use is rarely questioned. In my NIPS 2009 paper, I present a category-free model of object relationships and address the problem of context-only recognition where the goal is to recognize an object solely based on contextual cues. Figure 1 shows an example of such a prediction task. Given K objects and their spatial configuration, is it possible to predict the appearance of a hidden object at some spatial location?

Figure 2


I present a model called the Visual Memex (visualized in Figure 2), which is a non-parametric graph-based model of visual concepts and their interactions. Unlike traditional approaches to object-object modeling which learn potentials between every pair of categories (the number of such pairs scales quadratically with the number of categories), I make no category assumptions for context.

The official paper is out, and can be found on my project page:

Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009. PDF

Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.

I gave at talk about my work yesterday at CMU's Misc-read and received some good feedback. I'll be at NIPS this December representing this body of research.

Monday, October 26, 2009

Wittgenstein's Critique of Abstract Concepts

In his Philosophical Investigations, Wittgenstein argues against abstraction -- via several thought experiments he strives to annihilate the view that during their lives humans develop neat and consistent concepts in their minds (akin to building a dictionary). He criticizes the commonplace notions of meaning and concept formation (as were commonly used in philosophical circles at the time) and has contributed greatly to my own ideas regarding categorization in computer vision.

Wittgenstein asks the reader to come up with the definition of the concept "game." While we can look up the definition of "game" in a dictionary, we can't help but feel that any definition will be either too narrow or too broad. The number of exceptions we would need in a single definition scales as the number of unique games we've been exposed to. His point wasn't that game cannot be defined -- it was that the lack of a formal definition does not prevent us from using the word "game" correctly. Think of a child growing up and being exposed to multi-player games, single-player games, fun games, competitive games, games that are primarily characterized by their display of athleticism (aka sports or Olympic Games). Let's not forget activities such as courting and the Stock Market which are also referred to as "games." Wittgenstein criticizes the idea that during our lives we somehow determine what is common between all of those examples of games and form an abstract concept of game which determines how we categorize novel activities. For Wittgenstein, our concept of game is not much more than our exposure to activities labeled as games and our ability to re-apply the word game in future context.

Wittgenstein's ideas are an antithesis to Platonic Realism and Aristotle's Classical notion of Categories, where concepts/categories are pure, well-defined, and possess neatly defined boundaries. For Wittgenstein, experience is the anchor which allows us to measure the similarity between a novel activity and past activities referred to as games. Maybe the ineffability of experience isn't because internal concepts are inaccessible to introspection, maybe there is simply no internal library of concepts in the first place.

An experience-based view of concepts (or as my advisor would say, a data-driven theory of concepts) suggests that there is no surrogate for living a life rich with experience. While this has implications for how one should live their own life, it also has implications in the field of artificial intelligence. The modern enterprise of "internet vision" where images are labeled with categories and fed into a classifier has to be questioned. While I have criticized categories, there are also problems with a purely data-driven large-database-based approach. It seems that a good place to start is by pruning away redundant bits of information; however, judging what is redundant and how is still an open question.

Monday, October 19, 2009

Scene Prototype Models for Indoor Image Recognition

In today's post I want to briefly discuss a computer vision paper which has caught my attention.

In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.


The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.

Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.

Tuesday, October 13, 2009

What is segmentation-driven object recognition?

In this post, I want to discuss what the term "segmentation-driven object recognition" means to me. While segmentation-only and object recognition-only research papers are ubiquitous in vision conferences (such as CVPR , ICCV, and ECCV), a new research direction which uses segmentation for recognition has emerged. Many researchers pushing in this direction are direct descendants of the great J. Malik such as Belongie, Efros, Mori, and many others. The best example of segmentation-driven recognition can be found in Rabinovich's Objects in Context paper. The basic idea in this paper is to compute multiple stable segmentations of an input image using Ncuts and use a dense probabilistic graphical model over segments (combining local terms and segment-segment context) to recognize objects inside those regions.

Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.

Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.

It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:

(Taken from Bryan Russell's Object Recognition by Scene Alignment paper).

It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.


(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)

The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.

Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.

In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.

Saturday, September 19, 2009

joint regularization slides

Trevor Darrell posted his slides from BAVM about joint regularization across classifier learning. I think this is a really cool and promising idea and I plan on applying it to my own research on local distance function learning when I get back to CMU in October.

The idea is there should be significant overlap between what a cat classifier learns and what a dog classifier learns. So why independently learn two classifiers?

My paper on the Visual Memex got accepted to NIPS 2009
so I will be there representing my work in December. Be sure to read future blog posts about this work which strives to break free from using categories in Computer Vision.

On another note, today was my last day interning at Google (a former Robograd was my mentor) and I will be driving back to Pittsburgh from Mountain View this Sunday. Yosemite is the first stop! I plan on doing some light hiking with my new Vibram Five Fingers! I've been using them for deadlifting and they've been great for both working out and just chilling/coding around the Googleplex.


Tuesday, August 18, 2009

exciting stuff at BAVM2009 #1: joint regularization

There were a couple of cool computer vision ideas that I was exposed to at BAVM2009. First, Trevor Darrell mentioned some cool work by Ariadna Quattoni on L1/L_inf regularization. The basic idea, which has also recently been used in other ICML 2009 works such as Han Liu and Mark Palatucci's Blockwise Coordinate Descent, is that you want to regularize across a bunch of problems. This is sometimes referred to as multi-task learning. Imagine solving two SVM optimization problems to find linear classifiers for detecting cars and bicycles in images. It is reasonable to expect that in high dimensional spaces these two classifiers will something in common. To provide more intuition, it might be the case that your feature set provides many irrelevant variables and when learning these classifiers independently much work is spent on removing these dumb variables. By doing some sort of joint regularization (or joint feature selection), you can share information across seemingly distinct classification problems.

In fact, when I was talking about my own CVPR08 work Daphne Koller suggested that this sort of regularization might work for my task of learning distance functions. However, I am currently exploiting the independence that I get from not doing any cross-problem regularization by solving the distance function learning problems independently. While regularization might be desirable, it couples problems and it might be difficult to solve hundreds of thousands of such problems jointly.

I will mention some other cool works in future posts.