An audio deepfake (also known as voice cloning or deepfake audio) is a product of artificial intelligence [1] used to create convincing speech sentences that sound like specific people saying things they did not say. [2] [3] [4] This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, [5] and also to help people who have lost their voices (due to throat disease or other medical problems) to get them back. [6] [7] Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.
Audio deepfakes, referred to as audio manipulations beginning in the early 2020s, are becoming widely accessible using simple mobile devices or personal computers. [8] These tools have also been used to spread misinformation using audio. [3] This has led to cybersecurity concerns among the global public about the side effects of using audio deepfakes, including its possible role in disseminating misinformation and disinformation in audio-based social media platforms. [9] People can use them as a logical access voice spoofing technique, [10] where they can be used to manipulate public opinion for propaganda, defamation, or terrorism. Vast amounts of voice recordings are daily transmitted over the Internet, and spoofing detection is challenging. [11] Audio deepfake attackers have targeted individuals and organizations, including politicians and governments. [12]
In 2019, scammers using AI impersonated the voice of the CEO of a German energy company and directed the CEO of its UK subsidiary to transfer € 220,000. [13] In early 2020, the same technique impersonated a company director as part of an elaborate scheme that convinced a branch manager to transfer $35 million. [14]
According to a 2023 global McAfee survey, one person in ten reported having been targeted by an AI voice cloning scam; 77% of these targets reported losing money to the scam. [15] [16] Audio deepfakes could also pose a danger to voice ID systems currently used by financial institutions. [17] [18] In March 2023, the United States Federal Trade Commission issued a warning to consumers about the use of AI to fake the voice of a family member in distress asking for money. [19]
In October 2023, during the start of the British Labour Party's conference in Liverpool, an audio deepfake of Labour leader Keir Starmer was released that falsely portrayed him verbally abusing his staffers and criticizing Liverpool. [20] That same month, an audio deepfake of Slovak politician Michal Šimečka falsely claimed to capture him discussing ways to rig the upcoming election. [21]
During the campaign for the 2024 New Hampshire Democratic presidential primary, over 20,000 voters received robocalls from an AI-impersonated President Joe Biden urging them not to vote. [22] [23] The New Hampshire attorney general said this violated state election laws, and alleged involvement by Life Corporation and Lingo Telecom. [24] In February 2024, the United States Federal Communications Commission banned the use of AI to fake voices in robocalls. [25] [26] That same month, political consultant Steve Kramer admitted that he had commissioned the calls for $500. He said that he wanted to call attention to the need for rules governing the use of AI in political campaigns. [27] In May, the FCC said that Kramer had violated federal law by spoofing the number of a local political figure, and proposed a fine of $6 million. Four New Hampshire counties indicted Kramer on felony counts of voter suppression, and impersonating a candidate, a misdemeanor. [28]
Audio deepfakes can be divided into three different categories:
Replay-based deepfakes are malicious works that aim to reproduce a recording of the interlocutor's voice. [29]
There are two types: far-field detection and cut-and-paste detection. In far-field detection, a microphone recording of the victim is played as a test segment on a hands-free phone. [30] On the other hand, cut-and-paste involves faking the requested sentence from a text-dependent system. [11] Text-dependent speaker verification can be used to defend against replay-based attacks. [29] [31] A current technique that detects end-to-end replay attacks is the use of deep convolutional neural networks. [32]
The category based on speech synthesis refers to the artificial production of human speech, using software or hardware system programs. Speech synthesis includes Text-To-Speech, which aims to transform the text into acceptable and natural speech in real-time, [33] making the speech sound in line with the text input, using the rules of linguistic description of the text.
A classical system of this type consists of three modules: a text analysis model, an acoustic model, and a vocoder. The generation usually has to follow two essential steps. It is necessary to collect clean and well-structured raw audio with the transcripted text of the original speech audio sentence. Second, the Text-To-Speech model must be trained using these data to build a synthetic audio generation model.
Specifically, the transcribed text with the target speaker's voice is the input of the generation model. The text analysis module processes the input text and converts it into linguistic features. Then, the acoustic module extracts the parameters of the target speaker from the audio data based on the linguistic features generated by the text analysis module. [8] Finally, the vocoder learns to create vocal waveforms based on the parameters of the acoustic features. The final audio file is generated, including the synthetic simulation audio in a waveform format, creating speech audio in the voice of many speakers, even those not in training.
The first breakthrough in this regard was introduced by WaveNet, [34] a neural network for generating raw audio waveforms capable of emulating the characteristics of many different speakers. This network has been overtaken over the years by other systems [35] [36] [37] [38] [39] [40] which synthesize highly realistic artificial voices within everyone’s reach. [41]
Text-To-Speech is highly dependent on the quality of the voice corpus used to realize the system, and creating an entire voice corpus is expensive.[ citation needed ] Another disadvantage is that speech synthesis systems do not recognize periods or special characters. Also, ambiguity problems are persistent, as two words written in the same way can have different meanings.[ citation needed ]
Audio deepfake based on imitation is a way of transforming an original speech from one speaker - the original - so that it sounds spoken like another speaker - the target one. [42] An imitation-based algorithm takes a spoken signal as input and alters it by changing its style, intonation, or prosody, trying to mimic the target voice without changing the linguistic information. [43] This technique is also known as voice conversion.
This method is often confused with the previous Synthetic-based method, as there is no clear separation between the two approaches regarding the generation process. Indeed, both methods modify acoustic-spectral and style characteristics of the speech audio signal, but the Imitation-based usually keeps the input and output text unaltered. This is obtained by changing how this sentence is spoken to match the target speaker's characteristics. [44]
Voices can be imitated in several ways, such as using humans with similar voices that can mimic the original speaker. In recent years, the most popular approach involves the use of particular neural networks called Generative Adversarial Networks (GAN) due to their flexibility as well as high-quality results. [29] [42]
Then, the original audio signal is transformed to say a speech in the target audio using an imitation generation method that generates a new speech, shown in the fake one.
The audio deepfake detection task determines whether the given speech audio is real or fake.
Recently, this has become a hot topic in the forensic research community, trying to keep up with the rapid evolution of counterfeiting techniques.
In general, deepfake detection methods can be divided into two categories based on the aspect they leverage to perform the detection task. The first focuses on low-level aspects, looking for artifacts introduced by the generators at the sample level. The second, instead, focus on higher-level features representing more complex aspects as the semantic content of the speech audio recording.
Many machine learning and deep learning models have been developed using different strategies to detect fake audio. Most of the time, these algorithms follow a three-steps procedure:
Over the years, many researchers have shown that machine learning approaches are more accurate than deep learning methods, regardless of the features used. [8] However, the scalability of machine learning methods is not confirmed due to excessive training and manual feature extraction, especially with many audio files. Instead, when deep learning algorithms are used, specific transformations are required on the audio files to ensure that the algorithms can handle them.
There are several open-source implementations of different detection methods, [46] [47] [48] and usually many research groups release them on a public hosting service like GitHub.
The audio deepfake is a very recent field of research. For this reason, there are many possibilities for development and improvement, as well as possible threats that adopting this technology can bring to our daily lives. The most important ones are listed below.
Regarding the generation, the most significant aspect is the credibility of the victim, i.e., the perceptual quality of the audio deepfake.
Several metrics determine the level of accuracy of audio deepfake generation, and the most widely used is the MOS (Mean Opinion Score), which is the arithmetic average of user ratings. Usually, the test to be rated involves perceptual evaluation of sentences made by different speech generation algorithms. This index showed that audio generated by algorithms trained on a single speaker has a higher MOS. [44] [34] [49] [50] [39]
The sampling rate also plays an essential role in detecting and generating audio deepfakes. Currently, available datasets have a sampling rate of around 16 kHz, significantly reducing speech quality. An increase in the sampling rate could lead to higher quality generation. [37]
Focusing on the detection part, one principal weakness affecting recent models is the adopted language.
Most studies focus on detecting audio deepfake in the English language, not paying much attention to the most spoken languages like Chinese and Spanish, [51] as well as Hindi and Arabic.
It is also essential to consider more factors related to different accents that represent the way of pronunciation strictly associated with a particular individual, location, or nation. In other fields of audio, such as speaker recognition, the accent has been found to influence the performance significantly, [52] so it is expected that this feature could affect the models' performance even in this detection task.
In addition, the excessive preprocessing of the audio data has led to a very high and often unsustainable computational cost. For this reason, many researchers have suggested following a Self-Supervised Learning approach, [53] dealing with unlabeled data to work effectively in detection tasks and improving the model's scalability, and, at the same time, decreasing the computational cost.
Training and testing models with real audio data is still an underdeveloped area. Indeed, using audio with real-world background noises can increase the robustness of the fake audio detection models.
In addition, most of the effort is focused on detecting Synthetic-based audio deepfakes, and few studies are analyzing imitation-based due to their intrinsic difficulty in the generation process. [11]
Over the years, there has been an increase in techniques aimed at defending against malicious actions that audio deepfake could bring, such as identity theft and manipulation of speeches by the nation's governors.
To prevent deepfakes, some suggest using blockchain and other distributed ledger technologies (DLT) to identify the provenance of data and track information. [8] [54] [55] [56]
Extracting and comparing affective cues corresponding to perceived emotions from digital content has also been proposed to combat deepfakes. [57] [58] [59]
Another critical aspect concerns the mitigation of this problem. It has been suggested that it would be better to keep some proprietary detection tools only for those who need them, such as fact-checkers for journalists. [29] That way, those who create the generation models, perhaps for nefarious purposes, would not know precisely what features facilitate the detection of a deepfake, [29] discouraging possible attackers.
To improve the detection instead, researchers are trying to generalize the process, [60] looking for preprocessing techniques that improve performance and testing different loss functions used for training. [10] [61]
Numerous research groups worldwide are working to recognize media manipulations; i.e., audio deepfakes but also image and video deepfake. These projects are usually supported by public or private funding and are in close contact with universities and research institutions.
For this purpose, the Defense Advanced Research Projects Agency (DARPA) runs the Semantic Forensics (SemaFor). [62] [63] Leveraging some of the research from the Media Forensics (MediFor) [64] [65] program, also from DARPA, these semantic detection algorithms will have to determine whether a media object has been generated or manipulated, to automate the analysis of media provenance and uncover the intent behind the falsification of various content. [66] [62]
Another research program is the Preserving Media Trustworthiness in the Artificial Intelligence Era (PREMIER) [67] program, funded by the Italian Ministry of Education, University and Research (MIUR) and run by five Italian universities. PREMIER will pursue novel hybrid approaches to obtain forensic detectors that are more interpretable and secure. [68]
DEEP-VOICE [69] is a publicly available dataset intended for research purposes to develop systems to detect when speech has been generated with neural networks through a process called Retrieval-based Voice Conversion (RVC). Preliminary research showed numerous statistically-significant differences between features found in human speech and that which had been generated by Artificial Intelligence algorithms.
In the last few years, numerous challenges have been organized to push this field of audio deepfake research even further.
The most famous world challenge is the ASVspoof, [45] the Automatic Speaker Verification Spoofing and Countermeasures Challenge. This challenge is a bi-annual community-led initiative that aims to promote the consideration of spoofing and the development of countermeasures. [70]
Another recent challenge is the ADD [71] —Audio Deepfake Detection—which considers fake situations in a more real-life scenario. [72]
Also the Voice Conversion Challenge [73] is a bi-annual challenge, created with the need to compare different voice conversion systems and approaches using the same voice data.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Human image synthesis is technology that can be applied to make believable and even photorealistic renditions of human-likenesses, moving or still. It has effectively existed since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work .
Music and artificial intelligence (AI) is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.
Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.
Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.
A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.
WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
Generative audio refers to the creation of audio files from databases of audio clips. This technology differs from synthesized voices such as Apple's Siri or Amazon's Alexa, which use a collection of fragments that are stitched together on demand.
Deepfakes were originally defined as synthetic media that have been digitally manipulated to replace one person's likeness convincingly with that of another. The term was coined in 2017 by a Reddit user, and has later been expanded to cover any videos, pictures, or audio made with artificial intelligence to appear real, for example realistic-looking images of people who do not exist. While the act of creating fake content is not new, deepfakes leverage tools and techniques from machine learning and artificial intelligence, including facial recognition algorithms and artificial neural networks such as variational autoencoders (VAEs) and generative adversarial networks (GANs). In turn the field of image forensics develops techniques to detect manipulated images. Deepfakes have garnered widespread attention for their potential use in creating child sexual abuse material, celebrity pornographic videos, revenge porn, fake news, hoaxes, bullying, and financial fraud. The spreading of disinformation and hate speech through deepfakes has a potential to undermine core functions and norms of democratic systems by interfering with people's ability to participate in decisions that affect them, determine collective agendas and express political will through informed decision-making. Both the information technology industry and government have responded with recommendations to detect and limit their use.
A Tsetlin machine is an artificial intelligence algorithm based on propositional logic.
Artificial intelligence art is visual artwork created through the use of an artificial intelligence (AI) program.
Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.
15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.
Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.
Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.
ElevenLabs is a software company that specializes in developing natural-sounding speech synthesis software using deep learning.
Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.
Audio inpainting is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal. Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording.
{{cite book}}
: CS1 maint: date and year (link){{cite news}}
: CS1 maint: url-status (link)