But like many other goals in AI, computer vision has proven to be easier said than done. In 1966, scientists at MIT launched “The Summer Vision Project,” a two-month effort to create a computer system that could identify objects and background areas in images. But it took much more than a summer break to achieve those goals. In fact, it wasn’t until the early 2010s that image classifiers and object detectors were flexible and reliable enough to be used in mainstream applications. In the past decades, advances in machine learning and neuroscience have helped make great strides in computer vision. But we still have a long way to go before we can build AI systems that see the world as we do. Biological and Computer Vision, a book by Harvard Medical University Professor Gabriel Kreiman, provides an accessible account of how humans and animals process visual data and how far we’ve come toward replicating these functions in computers. Kreiman’s book helps understand the differences between biological and computer vision. The book details how billions of years of evolution have equipped us with a complicated visual processing system, and how studying it has helped inspire better computer vision algorithms. Kreiman also discusses what separates contemporary computer vision systems from their biological counterpart. While I would recommend a full read of Biological and Computer Vision to anyone who is interested in the field, I’ve tried here (with some help from Gabriel himself) to lay out some of my key takeaways from the book.
Hardware differences
In the introduction to Biological and Computer Vision, Kreiman writes, “I am particularly excited about connecting biological and computational circuits. Biological vision is the product of millions of years of evolution. There is no reason to reinvent the wheel when developing computational models. We can learn from how biology solves vision problems and use the solutions as inspiration to build better algorithms.” And indeed, the study of the visual cortex has been a great source of inspiration for computer vision and AI. But before being able to digitize vision, scientists had to overcome the huge hardware gap between biological and computer vision. Biological vision runs on an interconnected network of cortical cells and organic neurons. Computer vision, on the other hand, runs on electronic chips composed of transistors. Therefore, a theory of vision must be defined at a level that can be implemented in computers in a way that is comparable to living beings. Kreiman calls this the “Goldilocks resolution,” a level of abstraction that is neither too detailed nor too simplified. For instance, early efforts in computer vision tried to tackle computer vision at a very abstract level, in a way that ignored how human and animal brains recognize visual patterns. Those approaches have proven to be very brittle and inefficient. On the other hand, studying and simulating brains at the molecular level would prove to be computationally inefficient. “I am not a big fan of what I call ‘copying biology,’” Kreiman told TechTalks. “There are many aspects of biology that can and should be abstracted away. We probably do not need units with 20,000 proteins and a cytoplasm and complex dendritic geometries. That would be too much biological detail. On the other hand, we cannot merely study behavior—that is not enough detail.” In Biological and Computer Vision, Kreiman defines the Goldilocks scale of neocortical circuits as neuronal activities per millisecond. Advances in neuroscience and medical technology have made it possible to study the activities of individual neurons at millisecond time granularity. And the results of those studies have helped develop different types of artificial neural networks, AI algorithms that loosely simulate the workings of cortical areas of the mammal brain. In recent years, neural networks have proven to be the most efficient algorithm for pattern recognition in visual data and have become the key component of many computer vision applications.
Architecture differences
The recent decades have seen a slew of innovative work in the field of deep learning, which has helped computers mimic some of the functions of biological vision. Convolutional layers, inspired by studies made on the animal visual cortex, are very efficient at finding patterns in visual data. Pooling layers help generalize the output of a convolutional layer and make it less sensitive to the displacement of visual patterns. Stacked on top of each other, blocks of convolutional and pooling layers can go from finding small patterns (corners, edges, etc.) to complex objects (faces, chairs, cars, etc.). But there’s still a mismatch between the high-level architecture of artificial neural networks and what we know about the mammal visual cortex. “The word ‘layers’ is, unfortunately, a bit ambiguous,” Kreiman said. “In computer science, people use layers to connote the different processing stages (and a layer is mostly analogous to a brain area). In biology, each brain region contains six cortical layers (and subdivisions). My hunch is that six-layer structure (the connectivity of which is sometimes referred to as a canonical microcircuit) is quite crucial. It remains unclear what aspects of this circuitry should we include in neural networks. Some may argue that aspects of the six-layer motif are already incorporated (e.g. normalization operations). But there is probably enormous richness missing.” Also, as Kreiman highlights in Biological and Computer Vision, information in the brain moves in several directions. Light signals move from the retina to the inferior temporal cortex to the V1, V2, and other layers of the visual cortex. But each layer also provides feedback to its predecessors. And within each layer, neurons interact and pass information between each other. All these interactions and interconnections help the brain fill in the gaps in visual input and make inferences when it has incomplete information. In contrast, in artificial neural networks, data usually moves in a single direction. Convolutional neural networks are “feedforward networks,” which means information only goes from the input layer to the higher and output layers. There’s a feedback mechanism called “backpropagation,” which helps correct mistakes and tune the parameters of neural networks. But backpropagation is computationally expensive and only used during the training of neural networks. And it’s not clear if backpropagation directly corresponds to the feedback mechanisms of cortical layers. On the other hand, recurrent neural networks, which combine the output of higher layers into the input of their previous layers, still have limited use in computer vision. In our conversation, Kreiman suggested that lateral and top-down flow of information can be crucial to bringing artificial neural networks to their biological counterparts. “Horizontal connections (i.e., connections for units within a layer) may be critical for certain computations such as pattern completion,” he said. “Top-down connections (i.e., connections from units in a layer to units in a layer below) are probably essential to make predictions, for attention, to incorporate contextual information, etc.” He also said out that neurons have “complex temporal integrative properties that are missing in current networks.”
Goal differences
Evolution has managed to develop a neural architecture that can accomplish many tasks. Several studies have shown that our visual system can dynamically tune its sensitivities to the goals we want to accomplish. Creating computer vision systems that have this kind of flexibility remains a major challenge, however. Current computer vision systems are designed to accomplish a single task. We have neural networks that can classify objects, localize objects, segment images into different objects, describe images, generate images, and more. But each neural network can accomplish a single task alone.
Integration differences
In humans and animals, vision is closely related to smell, touch, and hearing senses. The visual, auditory, somatosensory, and olfactory cortices interact and pick up cues from each other to adjust their inferences of the world. In AI systems, on the other hand, each of these things exists separately. Do we need this kind of integration to make better computer vision systems? “As scientists, we often like to divide problems to conquer them,” Kreiman said. “I personally think that this is a reasonable way to start. We can see very well without smell or hearing. Consider a Chaplin movie (and remove all the minimal music and text). You can understand a lot. If a person is born deaf, they can still see very well. Sure, there are lots of examples of interesting interactions across modalities, but mostly I think that we will make lots of progress with this simplification.” However, a more complicated matter is the integration of vision with more complex areas of the brain. In humans, vision is deeply integrated with other brain functions such as logic, reasoning, language, and common sense knowledge. “Some (most?) visual problems may ‘cost’ more time and require integrating visual inputs with existing knowledge about the world,” Kreiman said. He pointed to following picture of former U.S. president Barack Obama as an example. “No current architecture can do this. All of this will require dynamics (we do not appreciate all of this immediately and usually use many fixations to understand the image) and integration of top-down signals,” Kreiman said. Areas such as language and common sense are themselves great challenges for the AI community. But it remains to be seen whether they can be solved separately and integrated together along with vision, or integration itself is the key to solving all of them. “At some point we need to get into all of these other aspects of cognition, and it is hard to imagine how to integrate cognition without any reference to language and logic,” Kreiman said. “I expect that there will be major exciting efforts in the years to come incorporating more of language and logic in vision models (and conversely incorporating vision into language models as well).”