Thursday, March 24, 2011

Computer Vision is Artificial Intelligence

Computer vision is a diverse field and its researchers have multifaceted interests and aspirations.  It should not be surprising that no two vision researchers think about the field in the same way.  Different academic backgrounds foster alternative and potentially incommensurable interpretations.  It is as if W.V.O Quine's thesis that no observation can be "theory-independent" directly applies to vision: a researcher in computer vision cannot uphold a view on his own field that is objective and independent of their own predispositions, upbringing, and educational program.  While I cannot speak clearly about the long-term goals of the entire body researchers in vision, today I would like discuss my own take on computer vision.  I do not offer the world an objective account of why computer vision intrigues me, but by sharing with the world the reasons why I find vision exciting, perhaps together we can break the boundaries of machine intelligence.

Cognitive Science is a computational study of the mind: McGill Cognitive Science

One of the biggest accomplishments in the field of Artificial Intelligence was when Deep Blue, a chess playing program developed at IBM, beat the world chess champion, Garry Kasparov.  But this was in the early days of artificial intelligence -- when computer scientists still weren't sure on what it means for a machine to be intelligent.  Chess is a well-known thinking-man's game, and at first glance it seems that a machine can only be worthy of being dubbed intelligent if it performs competitively on intelligent-people activities such as chess.

Chess: Human vs. Machine: Slate article about Deep Blue

Given the plethora of tasks that humans can effortlessly perform in daily life, is engineering a machine to rival humans on just one such task bringing researchers any closer to building truly intelligent machines?

The problem with chess is that it has a "finite universe problem" -- there is a finite number of primitives (the chess pieces) which can be manipulated by choosing a move from a finite set of allowable actions.  But if we think of normal life (going to work, eating dinner, talking to a friend) as a game, then it is not hard to see that most everyday situations involving humans involve a sea of infinite objects (just look around and name all the different objects you can see around you!) and an equally capacious space of allowable actions (consider all the things you could with all those objects around you!).  Intelligence is what allows us to cope with the complexities of the universe by focusing our attention on a limited set of relevant variables -- but the working set of objects/concepts we must consider at any single instant is chosen from a seemingly infinite set of alternatives.

I believe that everyday human-level visual intelligence is greatly undervalued by people -- and there is a very good reason for this!  The ability to make sense of what is going on in a single picture is such a trivial and autonomous task for humans, that we don't even bother quantifying just how good we are at it.  But let me reassure you that automated image understanding is no trivial feat.  The world is not composed of 20 visual object categories and the space of allowable and interpretable utterances we could associate with a static picture is seemingly infinite.  While the 20 category object detection task (as popularized by the PASCAL VOC) does have a finite universe problem, the grander version of the vision master problem (a combination of detection/recognition/categorization where you can interpret an input any way you like) is much more complex and mirrors the structure of the external world well.

Robotics Challenge: Build a Robot like Bender

Any application which calls for automated analysis of images requires vision.  A robot, if it is to be successful interacting with the world and performing useful tasks, needs to perceive the external world and organize it.  While some see vision as just one small piece of the "Robotics Challenge" (build a robot and make it do cool stuff), it totally unclear to me where to draw the boundary between low-level pixel analysis and high-level cognitive scene understanding.  Over the years, I have been thinking more and more about this problem, and I've convinced myself that the interesting part of vision is precisely at the boundary between what is commonly thought of as low-level representation of signal and what is considered high-level representation of visual concepts.  While some view computer vision as "applied mathematics" or "applied machine learning" or "image processing in disguise", I passionately believe the following:

Computer Vision is Artificial Intelligence

I am not promulgating the thesis that all aspects of machine intelligence are visual, but I want to assure you that there are enough high-level semantic capabilities which must be set in place for vision to work, that it is not worthwhile to think of vision as smaller problem than general purpose intelligence.  I believe that once we have made progress on vision (not in the narrow-universe setting) to the point where generic visual scene understanding is effectively solved, there won't be much left that needs to go into the "ethereal" mind which cognitive scientists want to empower machines with!  The only way to make machines truly understand scenes, objects, and their interactions is to make machines know something about the fabric of human life, and it is important for machines to learn this for themselves from real-world experience.  This goes beyond representing object appearance because folk physics, folk psychology, causality, spatio-temporal continuity, etc are all faculties which vision systems will need (at least the vision systems I want to ultimately build) for general purpose scene understanding.  I don't want to undermine the efforts of cognitive scientists (which work on many of the theories/ideas I've delineated before), but perhaps only to convince them that I have been a cognitive scientist all along.  I don't think placing a label on myself, by calling myself as either a cognitive scientist, a computer vision researcher, or AI researcher is very conducive to good research.


  1. Feel free to comment if your own computer vision philosophy is at odds with anything I said.

  2. With the same argument, one can also say that Natural Language Processing is Artificial Intelligence too. If you think about it, to understand a sentence, it is not suffice to have a dictionary of sorts of the meaning of every word in the sentence. One needs context and knowledge about the real world for a deeper understanding. How to represent knowledge itself is a problem.

    1. In fact Natural Language Processing is Artificial Intelligence. Just like speaking the languages. See Using language in the right context and understanding the context of the conversation is AI. Maybe that is only my opinon but I am sure about it.

  3. Anonymous4:23 PM

    The subset of computer vision that you define here would be simply the identification of objects in visual space which I think is probably not high-level enough to be considered real AI.

    At the same time, from my limited knowledge of the AI domain, I don't think we have anything like that yet. Perhaps we should take baby steps on our march towards AI and knock these issues out one at at time?

  4. perhaps start by learning from human vision system, language system etc.

  5. I think computer vision is an AI subfield.
    On the other hand a bat or other creature without vision surely could become intelligent.

  6. I think 'making sense of' any kind of sensor data is AI, which brings vision, sound, touch and any other ubiquitous sensing modality under purview. In that sense Tomasz's statement is correct and being a computer vision student myself, I do not begrudge his loyalty to the field. Furthermore if you consider us humans, a large part of our intelligence (in > 90% of humans) have evolved from our visual systems. If you have read Hans Moravec's memo "Locomotion, Vision and intelligence", you would know that there is in fact a correlation between evolution of vision and locomotion (which in itself was an evolutionary necessity for food gathering).

  7. If you think about it, to understand a sentence, it is not suffice to have a dictionary of sorts of the meaning of every word in the sentence. One needs context and knowledge about the real world for a deeper understanding.

  8. Stephen Tashiro10:24 AM

    A common feature of the computer vision systems that appear in student projects and theses is that "the output" of the vision system is some sort of data structure that attempts to give complete information about the scene being analyzed. However, when I look at a scene, I don't have the sense of any comprehensive set of information that is simultaneously perceived . I can answer questions questions about a scene such as "is that printer in front of the stack of books or behind it?". If I form the intent to pick up a book from the stack of books, I can guide my hand to it. From the point of view of engineering a robot, it would be seem convenient to have a data structure with "complete" information about the scene but it's unclear whether such a structure exists for human beings.

    I agree with the title "Computer Vision Is Artificial Intelligence". Implementing the whole process may include implementing a system that can formulate a question.

  9. Here is how I view intelligence... Complex creatures have goals. Humans are complex creatures. Humans have goals. What makes us unique compared to other creatures is how we adapt rapidly within a single lifetime to accomplish our goals. When I say 'goal' I'm speaking very generally about what I view as 'the main goal' of all creatures, survival. Most people reading this blog are of the intellectual type. We survive by convincing others that the thoughts that bubble out of our heads are worth money. We take this money and buy food. Yada, yada, division of labor, no big deal. However, the big deal is this: if civilization were to end tomorrow (i.e. there was no more use for our brains) every last one of us could start foraging for food, or start a garden, etc. To say it another way, most creatures are hard-wired to perform a small array of tasks to accomplish their ultimate goal of survival. If something unexpected comes up that interrupts whatever they do to survive, they parish. To put it in yet another way: fleas have a very narrow spectrum of potential behavior operations; humans, compared to fleas and every other creature we are aware of, have a very very broad spectrum of what we are capable of doing. And not only is our behavior spectrum immense but it isn't static.

    Human beings don't experience 'reality;' we do however experience a slice of reality. Vision is a slice of the slice. Any slice of reality that we do experience is dictated by the hardware that we are equipped with, or that we have built. Our eyes are hardware in this sense, but so is a spectrometer. We call ourselves intelligent because we navigate through this narrow band of reality successfully. We navigate successfully because we have sensors, but what propels us, is the nagging chemically ingrained thought that we must accomplish our goals. So to give another piece of equipment intelligence is to give it sensors and appendages that mesh with it's goals, but to also give it the ability to analyze what it is sensing and place it within context of it's survival.

  10. Anonymous2:28 PM

    It's interesting to take the thing from the opposite side. We don't see and know-we know because we see-so intelligence can be created from the visual learning experience. Epistemology aside I think it's brilliant to turn the thing on it's head and craft an approach. Bravo!

  11. Anonymous3:02 PM

    I can say one thing (from experience): general computer vision is not possible without AI. Some simpler tasks are, but general computer vision with recognition and object matching is not possible without AI. I have experienced this myself (I am a software engineer) when was working on a computer vision project which I thought would take 2 months, and it took 3 years. It worked out in the end, but even now it is not 100% (too many statistics, assumptions and thresholds which are in my opinion too inflexible for general use).

  12. Plenty of animals can perform sophisticated vision tasks, yet we don't consider them to have human level intelligence. Lower mammals, birds, and reptiles can recognize objects, perceive spatial relationships, and generally navigate the world using vision. The vast majority of visual perception does not require sophisticated cognitive capabilities at the human level. Therefore, there will be quite a bit of work left in the field of AI beyond the point at which computer vision is considered "solved."

  13. Computer vision, in its early days, was nothing more than signal processing. But today's systems are able to perform reasonably well on object recognition benchmarks. And if you look closely at the research community, you'll see plenty attempts at tasks like "action recognition" and "emotion recognition" under the computer vision category. Just like any other hard CS problem, we keep raising the bar.

    We'll keep wanting more out of our vision systems, and once we get to 90% on a task, we'll strive for the next human-like ability. It used to be edge detection, then segmentation, then classification, then emotions, then actions, and the push won't stop.

    I'm not entirely sure where to draw the boundary between AI and Computer Vision. There's a lot to the perception problem, and the kind of "world knowledge" required to fully understand a picture goes beyond a mere 100,000-way classification problem. It's almost as if the software has to first live in the world, learn from the world, and then be applied to an image recognition task. Maybe embodiment is necessary for learning. Maybe not.

    Learning architectures of today seem to be converging, but we've been feeding object recognition algorithms the same kind of data for the past 20 years. There's a lot we know, and much more that we don't know.