Audio Deepfakes
   HOME

TheInfoList



OR:

Audio deepfake technology, also referred to as voice cloning or deepfake audio, is an application of
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
designed to generate
speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...
that convincingly mimics specific individuals, often synthesizing phrases or sentences they have never spoken. Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating
audiobooks An audiobook (or a talking book) is a recording of a book or other work being read out loud. A reading of the complete text is described as "unabridged", while readings of shorter versions are abridgements. Spoken audio has been available in sch ...
and assisting individuals who have lost their voices due to medical conditions. Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding
text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or Computer hardware, hardware products. A text-to-speech (TTS) system conv ...
systems, and advanced speech
translation services The language industry is the sector of activity dedicated to facilitating multilingual communication, both oral and written. According to the European Commission's Directorate-General of Translation, the language industry comprises following acti ...
.


Incidents of fraud

Audio
deepfake ''Deepfakes'' (a portmanteau of and ) are images, videos, or audio that have been edited or generated using artificial intelligence, AI-based tools or AV editing software. They may depict real or fictional people and are considered a form of ...
s, referred to as audio manipulations beginning in the early 2020s, are becoming widely accessible using simple mobile devices or
personal computer A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
s. These tools have also been used to spread misinformation using audio. This has led to
cybersecurity Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...
concerns among the global public about the side effects of using audio deepfakes, including its possible role in disseminating
misinformation Misinformation is incorrect or misleading information. Misinformation and disinformation are not interchangeable terms: misinformation can exist with or without specific malicious intent, whereas disinformation is distinct in that the information ...
and
disinformation Disinformation is misleading content deliberately spread to deceive people, or to secure economic or political gain and which may cause public harm. Disinformation is an orchestrated adversarial activity in which actors employ strategic dece ...
in audio-based social media platforms. People can use them as a logical access voice spoofing technique, where they can be used to manipulate public opinion for propaganda, defamation, or
terrorism Terrorism, in its broadest sense, is the use of violence against non-combatants to achieve political or ideological aims. The term is used in this regard primarily to refer to intentional violence during peacetime or in the context of war aga ...
. Vast amounts of voice recordings are daily transmitted over the Internet, and spoofing detection is challenging. Audio deepfake attackers have targeted individuals and organizations, including politicians and governments. In 2019, scammers using AI impersonated the voice of the CEO of a German energy company and directed the CEO of its UK subsidiary to transfer . In early 2020, the same technique impersonated a company director as part of an elaborate scheme that convinced a branch manager to transfer $35 million. According to a 2023 global
McAfee McAfee Corp. ( ), formerly known as McAfee Associates, Inc. from 1987 to 1997 and 2004 to 2014, Network Associates Inc. from 1997 to 2004, and Intel Security Group from 2014 to 2017, is an American proprietary software company focused on online ...
survey, one person in ten reported having been targeted by an AI voice cloning scam; 77% of these targets reported losing money to the scam. Audio deepfakes could also pose a danger to voice ID systems currently used by financial institutions. In March 2023, the United States
Federal Trade Commission The Federal Trade Commission (FTC) is an independent agency of the United States government whose principal mission is the enforcement of civil (non-criminal) United States antitrust law, antitrust law and the promotion of consumer protection. It ...
issued a warning to consumers about the use of AI to fake the voice of a family member in distress asking for money. In October 2023, during the start of the British Labour Party's conference in
Liverpool Liverpool is a port City status in the United Kingdom, city and metropolitan borough in Merseyside, England. It is situated on the eastern side of the River Mersey, Mersey Estuary, near the Irish Sea, north-west of London. With a population ...
, an audio deepfake of Labour leader
Keir Starmer Sir Keir Rodney Starmer (born 2 September 1962) is a British politician and lawyer who has served as Prime Minister of the United Kingdom since 2024 and as Leader of the Labour Party (UK), Leader of the Labour Party since 2020. He previously ...
was released that falsely portrayed him verbally abusing his staffers and criticizing Liverpool. That same month, an audio deepfake of Slovak politician
Michal Šimečka Michal Šimečka (born 10 May 1984) is a Slovak politician, journalist, and researcher, who served as a Vice-President of the European Parliament between 2022 and 2023. He also became a Member of the European Parliament between 2019 and 2023. In ...
falsely claimed to capture him discussing ways to rig the upcoming election. During the campaign for the 2024 New Hampshire Democratic presidential primary, over 20,000 voters received robocalls from an AI-impersonated President
Joe Biden Joseph Robinette Biden Jr. (born November 20, 1942) is an American politician who was the 46th president of the United States from 2021 to 2025. A member of the Democratic Party (United States), Democratic Party, he served as the 47th vice p ...
urging them not to vote. The New Hampshire attorney general said this violated state election laws, and alleged involvement by Life Corporation and Lingo Telecom. In February 2024, the United States
Federal Communications Commission The Federal Communications Commission (FCC) is an independent agency of the United States government that regulates communications by radio, television, wire, internet, wi-fi, satellite, and cable across the United States. The FCC maintains j ...
banned the use of AI to fake voices in robocalls. That same month, political consultant Steve Kramer admitted that he had commissioned the calls for $500. He said that he wanted to call attention to the need for rules governing the use of AI in political campaigns. In May, the FCC said that Kramer had violated federal law by spoofing the number of a local political figure, and proposed a fine of $6 million. Four New Hampshire counties indicted Kramer on felony counts of voter suppression, and impersonating a candidate, a misdemeanor.


Categories

Audio deepfakes can be divided into three different categories:


Replay-based

Replay-based deepfakes are malicious works that aim to reproduce a recording of the interlocutor's voice. There are two types: ''far-field'' detection and ''cut-and-paste'' detection. In far-field detection, a microphone recording of the victim is played as a test segment on a hands-free phone. On the other hand, cut-and-paste involves faking the requested sentence from a text-dependent system. Text-dependent speaker verification can be used to defend against replay-based attacks. A current technique that detects end-to-end replay attacks is the use of deep convolutional neural networks.


Synthetic-based

The category based on
speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...
refers to the artificial production of human speech, using software or hardware system programs. Speech synthesis includes text-to-speech, which aims to transform the text into acceptable and natural speech in real-time, making the speech sound in line with the text input, using the rules of linguistic description of the text. A classical system of this type consists of three modules: a text analysis model, an acoustic model, and a
vocoder A vocoder (, a portmanteau of ''vo''ice and en''coder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation. The vocoder wa ...
. The generation usually has to follow two essential steps. It is necessary to collect clean and well-structured raw audio with the transcripted text of the original speech audio sentence. Second, the text-to-speech model must be trained using these data to build a synthetic audio generation model. Specifically, the transcribed text with the target speaker's voice is the input of the generation model. The text analysis module processes the input text and converts it into linguistic features. Then, the acoustic module extracts the parameters of the target speaker from the audio data based on the linguistic features generated by the text analysis module. Finally, the vocoder learns to create vocal waveforms based on the parameters of the acoustic features. The final audio file is generated, including the synthetic simulation audio in a waveform format, creating speech audio in the voice of many speakers, even those not in training. The first breakthrough in this regard was introduced by WaveNet, a
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
for generating raw audio
waveform In electronics, acoustics, and related fields, the waveform of a signal is the shape of its Graph of a function, graph as a function of time, independent of its time and Magnitude (mathematics), magnitude Scale (ratio), scales and of any dis ...
s capable of emulating the characteristics of many different speakers. This network has been overtaken over the years by other systems which synthesize highly realistic artificial voices within everyone’s reach. Text-to-speech is highly dependent on the quality of the voice corpus used to realize the system, and creating an entire voice corpus is expensive. Another disadvantage is that speech synthesis systems do not recognize periods or special characters. Also, ambiguity problems are persistent, as two words written in the same way can have different meanings.


Imitation-based

Audio deepfake based on imitation is a way of transforming an original speech from one speaker - the original - so that it sounds spoken like another speaker - the target one. An imitation-based algorithm takes a spoken signal as input and alters it by changing its style, intonation, or prosody, trying to mimic the target voice without changing the linguistic information. This technique is also known as voice conversion. This method is often confused with the previous synthetic-based method, as there is no clear separation between the two approaches regarding the generation process. Indeed, both methods modify acoustic-spectral and style characteristics of the speech audio signal, but the Imitation-based usually keeps the input and output text unaltered. This is obtained by changing how this sentence is spoken to match the target speaker's characteristics. Voices can be imitated in several ways, such as using humans with similar voices that can mimic the original speaker. In recent years, the most popular approach involves the use of particular neural networks called
generative adversarial networks A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...
(GAN) due to their flexibility as well as high-quality results. Then, the original audio signal is transformed to say a speech in the target audio using an imitation generation method that generates a new speech, shown in the fake one.


Detection methods

The audio deepfake detection task determines whether the given speech audio is real or fake. Recently, this has become a hot topic in the
forensic Forensic science combines principles of law and science to investigate criminal activity. Through crime scene investigations and laboratory analysis, forensic scientists are able to link suspects to evidence. An example is determining the time and ...
research community, trying to keep up with the rapid evolution of counterfeiting techniques. In general, deepfake detection methods can be divided into two categories based on the aspect they leverage to perform the detection task. The first focuses on low-level aspects, looking for artifacts introduced by the generators at the sample level. The second, instead, focus on higher-level features representing more complex aspects as the semantic content of the speech audio recording. Many
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
models have been developed using different strategies to detect fake audio. Most of the time, these algorithms follow a three-steps procedure: # Each speech audio recording must be preprocessed and transformed into appropriate audio features; # The computed features are fed into the detection model, which performs the necessary operations, such as the training process, essential to discriminate between real and fake speech audio; # The output is fed into the final module to produce a prediction probability of the ''Fake'' class or the ''Real'' one. Following the ASVspoof challenge nomenclature, the Fake audio is indicated with the term ''"Spoof,"'' the Real instead is called ''"Bonafide."'' Over the years, many researchers have shown that machine learning approaches are more accurate than deep learning methods, regardless of the features used. However, the scalability of machine learning methods is not confirmed due to excessive training and manual feature extraction, especially with many audio files. Instead, when deep learning algorithms are used, specific transformations are required on the audio files to ensure that the algorithms can handle them. There are several open-source implementations of different detection methods, and usually many research groups release them on a public hosting service like
GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
.


Open challenges and future research direction

The audio deepfake is a very recent field of research. For this reason, there are many possibilities for development and improvement, as well as possible threats that adopting this technology can bring to our daily lives. The most important ones are listed below.


Deepfake generation

Regarding the generation, the most significant aspect is the credibility of the victim, i.e., the perceptual quality of the audio deepfake. Several metrics determine the level of accuracy of audio deepfake generation, and the most widely used is the
mean opinion score Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale ...
(MOS), which is the arithmetic average of user ratings. Usually, the test to be rated involves perceptual evaluation of sentences made by different speech generation algorithms. This index showed that audio generated by algorithms trained on a single speaker has a higher MOS. The sampling rate also plays an essential role in detecting and generating audio deepfakes. Currently, available datasets have a
sampling rate In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or s ...
of around 16 kHz, significantly reducing speech quality. An increase in the sampling rate could lead to higher quality generation. In March 2020, a
Massachusetts Institute of Technology The Massachusetts Institute of Technology (MIT) is a Private university, private research university in Cambridge, Massachusetts, United States. Established in 1861, MIT has played a significant role in the development of many areas of moder ...
researcher demonstrated data-efficient audio deepfake generation through 15.ai, a
web application A web application (or web app) is application software that is created with web technologies and runs via a web browser. Web applications emerged during the late 1990s and allowed for the server to dynamically build a response to the request, ...
capable of generating high-quality speech using only 15 seconds of training data, compared to previous systems that required tens of hours. The system implemented a unified multi-speaker model that enabled simultaneous training of multiple voices through speaker embeddings, allowing the model to learn shared patterns across different voices even when individual voices lacked examples of certain emotional contexts. The platform integrated
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
through
DeepMoji An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...
for emotional expression and supported precise pronunciation control via ARPABET
phonetic transcription Phonetic transcription (also known as Phonetic script or Phonetic notation) is the visual representation of speech sounds (or ''phonetics'') by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the ...
s. The 15-second data efficiency benchmark was later corroborated by
OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
in 2024.


Deepfake detection

Focusing on the detection part, one principal weakness affecting recent models is the adopted language. Most studies focus on detecting audio deepfake in the English language, not paying much attention to the most spoken languages like Chinese and Spanish, as well as Hindi and Arabic. It is also essential to consider more factors related to different accents that represent the way of pronunciation strictly associated with a particular individual, location, or nation. In other fields of audio, such as
speaker recognition Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...
, the accent has been found to influence the performance significantly, so it is expected that this feature could affect the models' performance even in this detection task. In addition, the excessive preprocessing of the audio data has led to a very high and often unsustainable computational cost. For this reason, many researchers have suggested following a
self-supervised learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...
approach, dealing with unlabeled data to work effectively in detection tasks and improving the model's scalability, and, at the same time, decreasing the computational cost. Training and testing models with real audio data is still an underdeveloped area. Indeed, using audio with real-world background noises can increase the robustness of the fake audio detection models. In addition, most of the effort is focused on detecting synthetic-based audio deepfakes, and few studies are analyzing imitation-based due to their intrinsic difficulty in the generation process.


Defense against deepfakes

Over the years, there has been an increase in techniques aimed at defending against malicious actions that audio deepfake could bring, such as identity theft and manipulation of speeches by the nation's governors. To prevent deepfakes, some suggest using blockchain and other
distributed ledger A distributed ledger (also called a shared ledger or distributed ledger technology or DLT) is a system whereby replicated, shared, and synchronized digital data is geographically spread (distributed) across many sites, countries, or institutions. I ...
technologies (DLT) to identify the provenance of data and track information. Extracting and comparing affective cues corresponding to perceived emotions from digital content has also been proposed to combat deepfakes. Another critical aspect concerns the mitigation of this problem. It has been suggested that it would be better to keep some proprietary detection tools only for those who need them, such as fact-checkers for journalists. That way, those who create the generation models, perhaps for nefarious purposes, would not know precisely what features facilitate the detection of a deepfake, discouraging possible attackers. To improve the detection instead, researchers are trying to generalize the process, looking for preprocessing techniques that improve performance and testing different loss functions used for training.


Research programs

Numerous research groups worldwide are working to recognize media manipulations; i.e., audio deepfakes but also image and video deepfake. These projects are usually supported by public or private funding and are in close contact with universities and research institutions. For this purpose, the
Defense Advanced Research Projects Agency The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Adva ...
(DARPA) runs the Semantic Forensics (SemaFor). Leveraging some of the research from the Media Forensics (MediFor) program, also from DARPA, these semantic detection algorithms will have to determine whether a media object has been generated or manipulated, to automate the analysis of media provenance and uncover the intent behind the falsification of various content. Another research program is the Preserving Media Trustworthiness in the Artificial Intelligence Era (PREMIER) program, funded by the Italian Ministry of Education, University and Research (MIUR) and run by five Italian universities. PREMIER will pursue novel hybrid approaches to obtain forensic detectors that are more interpretable and secure. DEEP-VOICE is a publicly available dataset intended for research purposes to develop systems to detect when speech has been generated with neural networks through a process called Retrieval-based Voice Conversion (RVC). Preliminary research showed numerous statistically-significant differences between features found in human speech and that which had been generated by Artificial Intelligence algorithms.


Public challenges

In the last few years, numerous challenges have been organized to push this field of audio deepfake research even further. The most famous world challenge is the ASVspoof, the Automatic Speaker Verification Spoofing and Countermeasures Challenge. This challenge is a bi-annual community-led initiative that aims to promote the consideration of spoofing and the development of countermeasures. Another recent challenge is the ADD—Audio Deepfake Detection—which considers fake situations in a more real-life scenario. Also the Voice Conversion Challenge is a bi-annual challenge, created with the need to compare different voice conversion systems and approaches using the same voice data.


Extended use without permission

In 22 May 2025 it was claimed that
Hoya Corporation is a Japanese company manufacturing optical products such as photomasks, photomask blanks and hard disk drive platters, contact lenses and eyeglass lenses for the health-care market, medical photonics, lasers, photographic filters, medical fl ...
s product ''ReadSpeak'' used recording work done by the actress Gayanne Potter for them in 2021 which at the time she understood would just be used for accessibility and e-learning software, but is now available generally as the voice ''Iona'' and as is used as the announcer on
ScotRail ScotRail Trains Limited, trading as ScotRail (), is a Scottish train operating company that is publicly owned by Scottish Rail Holdings on behalf of the Scottish Government. It has been operating the ScotRail franchise as an operator of las ...
trains


See also

* 15.ai *
Artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
* *
Deepfake ''Deepfakes'' (a portmanteau of and ) are images, videos, or audio that have been edited or generated using artificial intelligence, AI-based tools or AV editing software. They may depict real or fictional people and are considered a form of ...
*
Deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
* Digital cloning *
Digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a ...
* Speech analysis *
Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
*
Speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...
*
Voice changer A voice changer (also known as voice enhancer) is a device which changes the tone or pitch of or adds distortion to the user's voice. Hardware implementations The earliest voice changers were electronic devices usually used over the telephone for t ...


References

{{Media manipulation Deepfakes Digital signal processing Digital forensics Speech synthesis