Matteo Wong
It was 2018, and the world as we knew it—or rather, how we knew it—teetered on a precipice. Against a rising drone of misinformation, The New York Times, the BBC, Good Morning America, and just about everyone else sounded the alarm over a new strain of fake but highly realistic videos. Using artificial intelligence, bad actors could manipulate someone’s voice and face in recorded footage almost like a virtual puppet and pass the product off as real. In a famous example engineered by BuzzFeed, Barack Obama seemed to say, “President Trump is a total and complete dipshit.” Synthetic photos, audio, and videos, collectively dubbed “deepfakes,” threatened to destabilize society and push us into a full-blown “infocalypse.”
More than four years later, despite a growing trickle of synthetic videos, the deepfake doomsday hasn’t quite materialized. Deepfakes’ harms have certainly been seen in the realm of pornography—where individuals have had their likeness used without their consent—but there’s been “nothing like what people have been really fearing, which is the incriminating, hyperrealistic deepfake of a presidential candidate saying something which swings major voting centers,” says Henry Ajder, an expert on synthetic media and AI. Compared with 2018’s disaster scenarios, which predicted outcomes such as the North Korean leader Kim Jong-un declaring nuclear war, “the state we’re at is nowhere near that,” says Sam Gregory, who studies deepfakes and directs the human-rights nonprofit Witness.
But those terrifying predictions may have just been early. The field of artificial intelligence has advanced rapidly since the 2018 deepfake panic, and synthetic media is once again the center of attention. The technology buzzword of 2022 is generative AI: models that seem to display humanlike creativity, turning text prompts into astounding images or commanding English at the level of a mediocre undergraduate. These and other advances have experts concerned that a deepfake apocalypse is still very much on the horizon. Fake video and audio might once again be poised to corrupt the most basic ways in which people process reality—or what’s left of it.
So far, deepfakes have been limited by two factors baked into their name: deep learning and fake news. The technology is complex enough—and simpler forms of disinformation are spread so easily—that synthetic media hasn’t seen widespread use.
Deep learning is an approach to AI that simulates the brain through an algorithm made up of many layers (hence, “deep”) of artificial neurons. Many of the deepfakes that sparked fear in 2018 were products of “generative adversarial networks,” which consist of two deep-learning algorithms: a generator and a discriminator. Trained on huge amounts of data—perhaps tens of thousands of human faces—the generator synthesizes an image, and the discriminator tries to tell whether it is real or fake. Based on the discriminator’s feedback, the generator “teaches” itself to produce more realistic faces, and the two continue to improve in an adversarial loop. First developed in 2014, GANs could soon produce uncannily realistic images, audio, and videos.
Yet by the 2018 and 2020 elections, and even the most recent midterms, deepfake technology still wasn’t realistic or accessible enough to be weaponized for political disinformation. Fabricating a decent synthetic video isn’t a “plug and play” process like commanding Lensa to generate artistic selfies or messing around in Photoshop, explains Hany Farid, a computer-science professor at UC Berkeley. Rather, it requires at least some knowledge of machine learning. GAN-generated images also have consistent tells, such as distortion around wisps of hair or earrings, misshapen pupils, and strange backgrounds. A high-quality product that will “fool a lot more people for a longer time … requires manual processing,” says Siewi Lyu, a deepfake expert at the University at Buffalo. “The human operator has to get involved in every aspect,” he told me: curating data, tweaking the model, cleaning up the computer’s errors by hand.
Those barriers mean deep learning certainly isn’t the most cost-effective way to spread fake news. Tucker Carlson and Marjorie Taylor Greene can just go on the air and lie to great effect; New York State recently elected a Republican representative whose storybook biography may be largely fiction; sporadic, cryptic text was enough for QAnon conspiracies to consume the nation; Facebook posts were more than sufficient for Russian troll farms. In terms of visual media, slowing down footage of Nancy Pelosi or mislabeling old war videos as having been shot in Ukraine already breeds plenty of confusion. “It’s far more effective to use a cruder form of media manipulation, which can be done quickly and by less sophisticated actors,” Ajder told me, “than to release an expensive, hard-to-create deepfake, which actually isn’t going to be as good a quality as you had hoped.”
Even if someone has the skills and resources to fabricate a persuasive video, the targets with the greatest discord-sowing potential, such as world leaders and high-profile activists, also have the greatest defenses. Software engineers, governments, and journalists work to verify footage of those people, says Renée DiResta, a disinformation expert and the research manager at the Stanford Internet Observatory. That has proved true for fabricated videos of Ukrainian President Volodymyr Zelensky and Russian President Vladimir Putin during the ongoing invasion; in one video, Zelensky appeared to surrender, but his oversize head and peculiar accent quickly got the clip removed from Facebook and YouTube. “Is doing the work of creating a plausible, convincing deepfake video something they need to do, or are there easier, less detectable mechanisms at their disposal?” DiResta posed to me. The pandemic is yet another misinformation hot spot that illustrates these constraints: A 2020 study of COVID-19 misinformation found some evidence of photos and videos doctored with simple techniques—such as an image edited to show a train transporting virus-filled tanks labeled covid-19—but no AI-based manipulations.
That’s not to diminish concerns about synthetic media and disinformation. In fact, widespread anxiety has likely slowed the rise of deepfakes. “Before the alarm was raised on these issues, you had no policies by social-media companies to address this,” says Aviv Ovadya, an internet-platform and AI expert who is a prominent voice on the dangers of synthetic media. “Now you have policies and a variety of actions they take to limit the impact of malicious deepfakes”—content moderation, human and software detection methods, a wary public.
But awareness has also created an environment in which politicians can more credibly dismiss legitimate evidence as forged. Donald Trump has reportedly claimed that the infamous Access Hollywood tape was fake; a GOP candidate once promoted a conspiracy theory that the video of police murdering George Floyd was a deepfake. The law professors Danielle Citron and Robert Chesney call this the “liar’s dividend”: Awareness of synthetic media breeds skepticism of all media, which benefits liars who can brush off accusations or disparage opponents with cries of “fake news.” Those lies then become part of the sometimes deafening noise of miscontextualized media, scientific and political disinformation, and denials by powerful figures, as well as a broader crumbling of trust in more or less everything.
All of this might change in the next few years as AI-generated media becomes more advanced. Every expert I spoke with said it’s a matter of when, not if, we reach a deepfake inflection point, after which forged videos and audio spreading false information will flood the internet. The timeline is “years, not decades,” Farid told me. According to Ovadya, “it’s probably less than five years” until we can type a prompt into a program and, by giving the computer feedback—make the hair blow this way, add some audio, tweak the background—create “deeply compelling content.” Lyu, too, puts five years as the upper limit to the emergence of widely accessible software for creating highly credible deepfakes.
Celebrity deepfakes are already popping up in advertisements; more and more synthetic videos and audio are being used for financial fraud; deepfake propaganda campaigns have been used to attack Palestinian-rights activists. This summer, a deepfake of the mayor of Kyiv briefly tricked the mayors of several European capitals during a video call.
And various forms of deepfake-lite technology exist all over the internet, including TikTok and Snapchat features that perform face swaps—replacing one person’s face with another’s in a video—similar to the infamous 2018 BuzzFeed deepfake that superimposed Obama’s face onto that of the filmmaker Jordan Peele. There are also easy-to-use programs such as Reface and DeepFaceLab whose explicit purpose is to produce decent-quality deepfakes. Revenge pornography has not abated. And some fear that TikTok, which is designed to create viral videos—and which is a growing source of news for American teenagers and adults—is especially susceptible to manipulated videos.
One of the biggest concerns is a new generation of powerful text-to-image software that greatly lowers the barrier to fabricating videos and other media. Generative-AI models of the sort that power DALL-E use a “diffusion” architecture, rather than GAN, to create complex imagery with a fraction of the effort. Fed hundreds of millions of captioned images, a diffusion-based model trains by changing random pixels until the image looks like static and then reversing that corruption, in the process “learning” to associate words and visual concepts. Where GANs must be trained for a specific type of image (say, a face in profile), text-to-image models can generate a wide range of images with complex interactions (two political leaders in conversation, for example). “You can now generate faces that are far more dynamic and realistic and customizable,” Ajder said. And many detection methods geared toward existing deepfakes won’t work on diffusion models.
The possibilities for deepfake propaganda are as dystopian now as they were a few years ago. At the largest scale, one can imagine fake videos of gruesome pregnancy terminations, like the saline-abortion images already used by anti-abortion activists; convincing, manipulated political speeches to feed global conspiracy theories; disparaging forgeries used against enemy nations during war—or even synthetic media that triggers conflict. Countries with fewer computer resources and talent or a less robust press will struggle even more, Gregory told me: “All of these problems are far worse when you look at Pakistan, Myanmar, Nigeria, a local news outlet in the U.S., rather than, say, The Washington Post.” And as deepfake technology improves to work with less training data, fabrications of lower-profile journalists, executives, government officials, and others could wreak havoc such that people think “there’s no new evidence coming in; there’s no new way to reason about the world,” Farid said.
Yet when deceit and propaganda feel like the air we breathe, deepfakes are at once potentially game-changing and just more of the same. In October, Gallup reported that only 34 percent of Americans trust newspapers, TV, and radio to report news fairly and accurately, and 38 percent have absolutely no confidence in mass media. Earlier this year, a Pew Research Center survey across 19 countries found that 70 percent of people think “the spread of false information online” is a major threat to their country, ranking just second behind climate change. “Deepfakes are really an evolution of existing problems,” Gregory said. He worries that focusing too heavily on sophisticated synthetic media might distract from efforts to mitigate the spread of “shallow fakes,” such as relabeled photographs and slightly doctored footage; DiResta is more concerned about text-based disinformation, which has been wreaking havoc for years, is easily generated using programs such as ChatGPT, and, unlike video or audio, has no obvious technical glitches.
The limited empirical research on the persuasiveness of synthetic video and audio is mixed. Although a few studies suggest that video and audio are a bit more convincing than text, others have found no appreciable difference; some have even found that people are better at detecting fabricated political speeches when presented with video or audio than with a transcript alone. Still, Ajder cautioned that “the deepfakes I’ve seen being used in these trials aren’t quite there; they still are on the cusp of uncanniness,” and that it’s difficult to replicate the conditions of social media—such as amplification and echo chambers—in a lab. Of course, those are the very conditions that have enabled an epistemic corrosion that will continue to advance with or without synthetic media.
Regardless of how a proliferation of deepfakes might worsen our information ecosystem—whether by adding to existing uncertainty or fundamentally changing it—experts, journalists, and internet companies are trying to prepare for it. The European Union and China have both passed regulations meant to target deepfakes by mandating that tech companies take action against them. Companies could implement guardrails to stop their technology from being misused; Adobe has gone as far as to never publicly release its deepfake-audio software, Voco.
There is still time to prevent or limit the most catastrophic deepfake scenarios. Many people favor building a robust authentication infrastructure: a log attached to every piece of media that the public can use to check where a photo or video comes from and how it has been edited. This would protect against both shallow- and deepfake propaganda, as well as the liar’s dividend. The Coalition for Content Provenance and Authenticity, led by Adobe, Microsoft, Intel, the BBC, and several other stakeholders, has designed such a standard—although until that protocol achieves widespread adoption, it is most useful for honest actors seeking to prove their integrity.
Once a deepfake is in circulation, detection is only the first of many hurdles for its debunking. Computers are far better than humans at distinguishing real and fake videos, Lyu told me, but they aren’t always accurate. Automated content moderation is infamously hard, especially for video, and even an optimistic 90 percent success rate could still leave tens or hundreds of thousands of the most pernicious clips online. That software should be made widely available to journalists, who also have to be trained to interpret the results, Gregory said. But even given a high-quality detection algorithm that is both accessible and usable, convincing the public to trust the algorithm, experts, and journalists exposing fabricated media might prove near impossible. In a world saturated with propaganda and uncertainty that long ago pushed us over the edge into what Ovadya calls “reality apathy,” any solution will first need to restore people’s willingness to climb their way out.
No comments:
Post a Comment