What You Need to Know About Audio Deepfakes
The recent circulation of deceptive robocalls in Massachusetts exemplifies the detrimental impact of such technology. We also recently spoke with the spoke with The Boston Globe about this technology, and how easy and inexpensive it is to generate such deepfake audios.
Anyone without a significant technical background can easily generate such audio, with multiple available tools online. Such fake news from deepfake generators can disturb financial markets and even electoral outcomes. The theft of one’s voice to access voice-operated bank accounts and the unauthorized utilization of one’s vocal identity for financial gain are reminders of the urgent need for robust countermeasures. Further risks may include privacy violation, where an attacker can utilize the victim’s audio without their permission or consent. Further, attackers can also alter the content of the original audio, which can have a serious impact.
Two primary and prominent directions have emerged in designing systems to detect fake audio: artifact detection and liveness detection. When audio is generated by a generative model, the model introduces some artifact in the generated signal. Researchers design algorithms/models to detect these artifacts. However, there are some challenges with this approach due to increasing sophistication of audio deepfake generators. In the future, we may also see models with very small or almost no artifacts. Liveness detection, on the other hand, leverages the inherent qualities of natural speech, such as breathing patterns, intonations, or rhythms, which are challenging for AI models to replicate accurately. Some companies like Pindrop are developing such solutions for detecting audio fakes.
Additionally, strategies like audio watermarking serve as proactive defenses, embedding encrypted identifiers within the original audio to trace its origin and deter tampering. Despite other potential vulnerabilities, such as the risk of replay attacks, ongoing research and development in this arena offer promising solutions to mitigate the threats posed by audio deepfakes.
Q: Despite their potential for misuse, what are some positive aspects and benefits of audio deepfake technology? How do you imagine the future relationship between AI and our experiences of audio perception will evolve?
A: Contrary to the predominant focus on the nefarious applications of audio deepfakes, the technology harbors immense potential for positive impact across various sectors. Beyond the realm of creativity, where voice conversion technologies enable unprecedented flexibility in entertainment and media, audio deepfakes hold transformative promise in health care and education sectors. My current ongoing work in the anonymization of patient and doctor voices in cognitive health-care interviews, for instance, facilitates the sharing of crucial medical data for research globally while ensuring privacy. Sharing this data among researchers fosters development in the areas of cognitive health care. The application of this technology in voice restoration represents a hope for individuals with speech impairments, for example, for ALS or dysarthric speech, enhancing communication abilities and quality of life.
I am very positive about the future impact of audio generative AI models. The future interplay between AI and audio perception is poised for groundbreaking advancements, particularly through the lens of psychoacoustics — the study of how humans perceive sounds. Innovations in augmented and virtual reality, exemplified by devices like the Apple Vision Pro and others, are pushing the boundaries of audio experiences towards unparalleled realism. Recently we have seen an exponential increase in the number of sophisticated models coming up almost every month. This rapid pace of research and development in this field promises not only to refine these technologies but also to expand their applications in ways that profoundly benefit society. Despite the inherent risks, the potential for audio generative AI models to revolutionize health care, entertainment, education, and beyond is a testament to the positive trajectory of this research field.
Rachel Gordon is Communications Manager at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). This story is reprinted with permission of MIT News.