Computer vision has transformed numerous tasks—from inspecting components on manufacturing lines to verifying the identify of passengers as they board flights. It can provide high-quality results in a wide array of situations and scenarios. However, as conditions become more complex and demanding, both the accuracy and utility of the technology drops off.
The problem is rooted in the nature of computer vision and artificial intelligence (AI). Data scientists train today's systems on convolutional networks using enormous volumes of data—in many cases, tens of thousands of images. Yet, a computer vision system likely will encounter challenges—and make serious errors—when it is forced to identify objects using incomplete or unknown variables.
"Today's computer vision systems are excellent at identifying things in a fixed environment that involves training," says Achuta Kadambi, an assistant professor of electrical engineering and computer science at the University of California, Los Angeles (UCLA). However, changes in lighting, visibility, or pressure can cause these systems to misjudge a situation, he says. "We cannot build an exhaustive training set that addresses all the real-world variables," he explains.
As a result, researchers are exploring ways to supplement conventional computer vision data with metadata collected from physics-based sensors and systems. In June 2023, a team from UCLA and the U.S. Army Research Laboratory introduced a novel approach in an academic paper that appeared in Nature Machine Intelligence. "The goal is to help these systems 'see' better by invoking ideas derived from physical laws," Kadambi says.
Adds Laleh Jalilian, an associate professor and anesthesiologist at the David Geffen School of Medicine at UCLA, a frequent collaborator with Kadambi: "Incorporating physics properties into computer vision could improve the accuracy of many devices and introduce entirely new technologies."
It is tempting to marvel at the power of today's computer vision technology, but engineers and product designers have been forced to confront an inconvenient truth: deep learning models operate without any intrinsic understanding of the objects and environments they "see." Results are based entirely on a model's ability to accurately predict what comes next.
Even the most advanced computer vision methods deliver limited predictive capabilities. Kadambi points out, for example, that when a deep learning model attempts to map the trajectory of a baseball or an airplane in motion, things can go terribly awry. Because a deep learning system is not geared to explicitly model environmental factors such as air pressure, resistance and weather, "The trajectory can range from slightly off to absurdly inaccurate," Kadambi says.
Such anomalies are not particularly important for a face scan, or when a system encounters multiple instances of the same item on an assembly line. However, for objects in motion and those that require advanced three-dimensional (3D) predictive capabilities based on planar geometry—for autonomous vehicles and some medical instruments, for example—problems can emerge. "The actual physical environment isn't incorporated explicitly into the computer vision framework," Kadambi notes.
These edge cases, which are not entirely uncommon, can cause a system to react in unpredictable, sometimes fatal, ways. For instance, in 2018, a woman was killed by a self-driving Uber vehicle because it failed to recognize she was walking a bicycle across the street. The computer vision system had been trained to spot pedestrians and bicycles, but not both together at the same time.
The hybrid approach that Kadambi and his fellow researchers developed takes direct aim at this challenge. By incorporating attributes based in physics—metadata derived from connected Internet of Things (IoT) devices, quantum sensors, and general human knowledge about physical properties—it is possible to reach an intelligence level that more closely resembles that of a human. Suddenly, a model can apply known properties of gravity, resistance, weight, motion, and air pressure to generate far more accurate predictions.
This framework focuses on three major areas: tagging objects with additional information that defines the way they behave; injecting physics into network architectures through coding that cameras and AI systems can read; and plugging physics data into training sets to build more robust AI models. The result is an autonomous vehicle, robot, or drone that likely will navigate better in inclement weather and under other difficult conditions.
An Eye on Reality
In the future, physics metadata could pay dividends beyond robotics and automated systems, however. At UCLA Medical Center, for instance, Jalilian, who has a background in engineering, is exploring ways to improve the precision and accuracy of medical devices through data tagging. This includes blood oximeters, which sometimes generate errors based on skin color.
Jalilian also is looking at using camera-generated images and multimodal sensors to measure patient vitals and glean other data remotely. "The technology could support far more advanced telemedicine. Remote medical devices with ambient AI algorithms running on a video stream could provide insights into a person's condition," she says. For example, a system might detect a patient with low blood oxygen and alert a physician. "This flips the equation from reactive to proactive medicine."
Other researchers also are examining ways to supplement computer vision with data. For example, a group of researchers from the Massachusetts Institute of Technology (MIT) and IBM have developed a framework that relies on object recognition methods modeled after the human brain. This additional neural data results in more "human-like" processing, says MIT professor James DiCarlo. In fact, adding biological data to vision streams led to a higher accuracy level for categorizing objects, though the system also tended to fail where humans fail.
While the idea of enhancing machine data with tags and other forms of metadata emanating from the physical world is only starting to take shape, Kadambi and others are optimistic the technique will lead to more robust and accurate computer vision—along with an ability to avoid false positives that can plague vision systems. Machines that see better could fundamentally change the way robotics, automation technologies, and other sensing systems operate, and how and where they are used.
"Images are fundamentally different than data that comes from language-based systems because images are generated based on the laws of optical physics," Kadambi concludes. Plugging physics-based metadata into computer visions systems "can produce far better models and greatly enhance their abilities."
Samuel Greengard is an author and journalist based in West Linn, OR, USA.
No entries found