In March 2022, a synthesized video of Ukrainian President Volodymyr Zelenskyy appeared on various social media platforms and a national news website. In the video, Zelenskyy urges his people to surrender in their fight against Russia; however, the speaker is not Zelenskyy at all. The minute-long clip was a deepfake, a synthesized video produced via deep learning models, and the president soon posted a legitimate message reaffirming his nation's commitment to defending its land and people.
The Ukrainian government already been had warning the public that state-sponsored deepfakes could be used as part of Russia's information warfare. The video itself was not particularly realistic or convincing, but the quality of deepfakes has been improving rapidly. "You have to be a little impressed by synthetic media," says University of California, Berkeley computer scientist and digital forensics expert Hany Farid. "In five years, we have gone from pretty cruddy, low-resolution videos to full-blown, high-resolution, very sophisticated 'Tom Cruise TikTok' deepfakes. It's evolving at light speed. We're entering a stage where it's becoming surprisingly easy to distort reality."
In some cases, such as the aforementioned TikTok example, in which a company generated a set of videos that closely resemble the famous actor, the result can be entertaining. Startups are engineering deepfake technology for companies to use in marketing videos, and Hollywood studios are slipping hyper-realistic digital characters into movies alongside human actors. Yet the malicious use of this technology for disinformation, blackmail, and other unsavory ends is concerning, according to researchers. If the deepfake of Zelenskyy had been as realistic as one of those Tom Cruise clips, the synthesized video could have had terrible consequences.
The potential for nefarious applications, and the pace at which the deepfake techniques are evolving, has sparked a race between the groups generating synthetic media and the scientists working to find more effective and resilient ways of detecting them. "We're playing this chess game in which detection is trying to keep pace or advance ahead of creation," says computer scientist Siwei Lyu of the University at Buffalo, State University of New York. "Once they know the tricks we use to detect them, they can fix their models to make the detection algorithms less effective. So every time they fix one, we have to develop a better one."
The roots of deepfake technology can be traced back to the development of generative adversarial networks (GANs) in 2014. The GAN approach pits two models against each other. In their paper introducing the concept, Ian Goodfellow and colleagues described the two models as analogous to the "game" between counterfeiters and police; the former tries to outwit the latter, and competition pushes them to a point at which the counterfeits approach the real thing. With deepfakes, the first model generates a synthetic image and the second tries to detect it as a fake. As the pair iterate, the generative model corrects its flaws, resulting in better and better images.
In the early days, it was relatively easy for people to recognize a fake video; inconsistencies in skin tone or irregularities in facial structure and movement were common. As the synthesis engines have improved, however, detection has become increasingly difficult. "People often think they're better than they are at detecting fake content. We're falling for stuff, but we don't know it," says Sophie Nightingale, a psychologist studying deepfake recognition at Lancaster University in the U.K. "We're arguably at the point where the human perceptual system can't tell if something is real or fake."
To keep pace with the evolution of the technology, researchers have been developing tools to spot telltale signs of digital forgery. In 2018, ACM Distinguished Member Lyu and one of his students at the University at Buffalo were studying deepfake videos in hopes of building better detection models. After watching countless examples and using publicly available technology to generate videos of their own, they noticed something strange. "The faces did not blink!" Lyu recalls. "They did not have realistic blinking eyes, and in some cases they did not blink at all."
Eventually, they realized the lack of eye blinking in the videos was the logical outcome of the training data. The models that generate synthetic video are trained on still images of a given subject. Typically, photographers do not publish images in which their subjects' eyes are closed. "We only upload images with open eyes," Lyu explains, "and that bias is learned and reproduced."
Lyu and his student created a model that detected deepfakes based on the lack of eyeblinks or irregular patterns in eye blinking, but not long after they released their results, the next wave of synthetic videos evolved. The Zelenskyy video, while poor in quality, does feature the Ukrainian president blinking.
The eye-blinking work is reflective of the predominant approach to detecting deepfakes: searching for evidence or artifacts of the generative or synthetic process. "These generative models learn about the subjects they recreate from training data," says Lyu. "You give them lots of data and they can create realistic synthetic media, but this is an inefficient way of learning about the real world, because anything that happens in the real world has to follow the laws of the real physical world, and that information is indirectly incorporated into the training data." Similarly, Lyu has pinpointed inconsistencies between the reflections in the corneas of the eyes of synthesized subjects and nearly imperceptible differences in the retinas.
The deep learning researcher Yuval Nirkin, currently a research scientist at CommonGround-AI, developed a detection method that compares the internal part of the face in a video with the surrounding context, including the head, neck, and hair regions. "The known video deepfake methods don't change the entire head," sayss Nirkin. "They focus only on the internal part of the face because while the human face has a simple geometry that is easy to model, the entire head is very irregular and contains a lot of very fine details that are difficult to reconstruct." Nirkin developed a model that segments a subject's face into inner and outer portions and extracts an identity signal from each. "If we find a discrepancy between the signals of the two parts," he explains, "then we can say someone altered the identity of the subject." The advantage of this approach, Nirkin adds, is that it is not focused on the flaws or artifacts associated with one particular deepfake generation model and can thereby be applied to unseen techniques.
Eventually, they realized the lack of eye blinking in the videos was the logical outcome of the training data.
At the University of California, Berkeley, Farid is pioneering a detection method that moves even further away from the focus on specific artifacts. Instead of looking for unrealistic signals, Farid and his students flipped the task around and designed a tool that studies actual, verified video footage of a person. The group's solution hunts for correlations between 780 different facial, gestural, and vocal features within that footage to build a better model of a particular person and that subject's facial, speech, and gestural patterns. Turning your head while you are speaking, for example, will change your vocal tract and generate slight changes in the sound of your voice, and the model identifies such links. As for Zelinskyy, among other things, he has a specific kind of asymmetry to his smile and certain habits of moving his arms as he speaks.
The researchers aggregate all these observations and correlations to create a model or classifier of the famous person, such as Zelinskyy. The accuracy of the classifier increases as more correlations are incorporated, reaching a 100% success rate when the group factored in all 780. When the classifier studies a video, and multiple features fall outside the model, then the technology concludes the sample is not actually the subject. "In some ways, we're not building a deepfake detector," Farid explains. "We're building a Zelinskyy detector."
Farid recognizes the synthesis engines are constantly improving; his group is not publicly releasing the code behind its classifier in hopes of slowing that evolution. Currently they are expanding their database and creating detectors for more world leaders.
As the deepfake generators improve, and the line between real and synthetic media becomes increasingly difficult to discern, developing new means of quickly detecting them only becomes more important. "Getting that balance right and making sure people rely on and trust the things they should and distrust the things they should not, that is a difficult but critical thing to do," explains Nightingale, the psychologist and deepfake researcher. "Otherwise, we could end up in a scenario where we do not trust anything."
Goodfellow, I. et al.
"Generative adversarial networks," Communications, Volume 63, Issue 11, November 2020.
Nightingale, S. and Farid, H.
"AI-synthesized faces are indistinguishable from real faces and more trustworthy," PNAS, February 14, 2022.
Boháček, M. and Farid, H.
"Protecting world leaders against deep fakes using facial, gestural, and vocal mannerisms," PNAS, November 23, 2022.
Nirkin, Y. et al.
"Deepfake Detection Based on Discrepancies Between Faces and Their Context," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, October 2022.
Li, Y., Chang, M., and Lyu, S.
"In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking," IEEE Workshop on Information Forensics and Security (WIFS), Hong Kong, 2018.
©2023 ACM 0001-0782/23/7
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found