Sign In

Communications of the ACM

ACM News

A New Way of Seeing

View as: Print Mobile App Share:
A representation of Vision AI

Today, Vision AI is being used to detect manufacturing defects, to assess damage from natural disasters, and even to detect when someone is carrying a weapon. Soon, it will likely discern the health of trees in a forest and spot cancer cells in a biopsy.


Human vision is a remarkable thing. The ability to observe the surrounding world—light, shapes and motion—allows us to accomplish a remarkable array of tasks. Yet imbuing these qualities in machines is nothing short of daunting.

An emerging technology called Vision AI has the challenge within its sights, however. It harnesses advances in processing power and artificial intelligence (AI) to understand visual events at a deeper level.

"Image and video data are by their nature unstructured. This approach takes the unstructured content and turns it into structured and actionable data," explains Vinod Valloppillil, lead of product management team at Google for Cloud Language and Vision AI.

Today, Vision AI is being used to detect manufacturing defects, to assess damage from natural disasters, and even to detect when someone is carrying a weapon. Soon, it will likely discern the health of trees in a forest and spot cancer cells in a biopsy.

Eye on AI

Just as personal computers and, later, the Web digitized paper and transformed the way humans process words and numbers, Vision AI is changing the way computers process images and video. It slides the dial "beyond simple object recognition," says Chhandomay Mandal, director of solutions marketing for Dell Technologies and an authority on Vision AI.

Early generations of image recognition tools merely identified objects; Vision AI aims to match or exceed human capabilities. "Anything you want to count, record, analyze, or store can be obtained by teaching Vision AI to look for it," said Issac Roth, a partner at venture capital fund Shasta Ventures, in a 2020 VentureBeat article.

The goal of Vision AI is to expand image identification and analysis beyond a single object in a photo or video, say a cat or a river. It tracks an object in motion, analyzes the background for subtle changes and variations, and attempts to understand the context and situation based on multiple events in the image. "Vision AI better identifies what data is important based on an application," Mandal says.

As a result, Vision AI has value across numerous fields, including industrial manufacturing, energy production, medicine, entertainment, and autonomous machines.

For example, the Zoological Society of London, an organization focused on protecting biodiversity, has tapped Vision AI to identify specific species in thousands of images. The process takes place in days rather than months, and the system can spot details that previously eluded researchers.

Fox Sports has turned to the technology to log and auto-discover video assets from millions of video clips residing the cloud. This makes it possible to search across specific criteria, from players' jerseys and Peyton Manning touchdown passes to a specific type of injury or on-field celebration.

A View of the Vision

Developing next-generation algorithms that tap the power of Vision AI is at the center of today's efforts. "These systems must be able to identify objects and activities across a much wider scope of possibilities," Mandel says.

There's also an added element of incorporating human psychology and perception, Valloppillil says. "You have to understand what's relevant and what's interesting and match that with what the computer is able to do. It's okay for a computer to identify 'grass' or 'a person running' in a photo, but it's far more valuable to humans for a computer to recognize that the image is "a quarterback tossing a touchdown pass.' To do the latter, the system needs to understand the context and conditions taking place."

Typically, data scientists train Vision AI systems to capture as much relevant data as possible from a set of events or scenarios. They may use several algorithms to generate responses to the various inputs; then they run the data through deep learning systems. After extensive statistical analysis and fine-tuning of neural networks, it is possible to begin using a Vision AI model.

Granted, it's a complicated task, but the technology will take machine vision to a far more sophisticated plane, Valloppillil says. For instance, it could help a drone or autonomous vehicle recognize an event that might go against its typical programming. "The Vision system might recognize a stop sign, but a higher-order model might have a reason to ignore a stop sign, or for a drone to avoid landing in a particular place because something is wrong," he says.

Vision AI could also introduce sophisticated features for Web browsers and mobile devices, particularly as apps like YouTube, Facebook, Instagram, TikTok, and Snapchat generate increasingly large volumes of unstructured image content.

Finally, the technology could introduce entirely new sensing techniques. This includes an ability to gauge temperature and other environmental conditions based on reflected light or data streaming in from quantum sensors.

For now, the biggest challenge is developing more advanced training models and building out frameworks that allow Vision AI to be used more widely. Says Mandal, "We have just scratched the surface of what AI is capable of doing. As resources increase and algorithms evolve, Vision AI systems will be able to perform more advanced and useful analysis."

Samuel Greengard is an author and journalist based in West Linn, OR, USA.


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account