Deep learning speech synthesis

Last updated

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Contents

Formulation

Given an input text or some sequence of linguistic units , the target speech can be derived by

where is the set of model parameters.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the loss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000  Hz, the loss function will be designed to have more penalty on this range:

where is the loss from human voice band and is a scalar, typically around 0.5. The acoustic feature is typically a spectrogram or Mel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis, as it reduces too much information.

History

A stack of dilated casual convolutional layers used in WaveNet WaveNet animation.gif
A stack of dilated casual convolutional layers used in WaveNet

In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original. [1]

In early 2017, Mila proposed char2wav, a model to produce raw waveform in an end-to-end method. In the same year, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text; months later, Google proposed Tacotron2, which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis.

A comparison of the alignments (attentions) between Tacotron and a modified variant of Tacotron Information-10-00131-g008.png
A comparison of the alignments (attentions) between Tacotron and a modified variant of Tacotron

This was followed by Google AI's Tacotron in 2018, which demonstrated that neural networks could produce highly natural speech synthesis but required substantial training data—typically tens of hours of audio—to achieve acceptable quality. Tacotron employed an encoder-decoder architecture with attention mechanisms to convert input text into mel-spectrograms, which were then converted to waveforms using a separate neural vocoder. When trained on smaller datasets, such as 2 hours of speech, the output quality degraded while still being able to maintain intelligible speech, and with just 24 minutes of training data, Tacotron failed to produce intelligible speech. [2]

HiFi-GAN model architecture, which consists of one generator and two discriminators. GAN-based vocoder implementations became widespread after the release of HiFi-GAN. HiFi-GAN model architecture.png
HiFi-GAN model architecture, which consists of one generator and two discriminators. GAN-based vocoder implementations became widespread after the release of HiFi-GAN.

In 2019, Microsoft Research introduced FastSpeech, which addressed speed limitations in autoregressive models like Tacotron. [3] FastSpeech utilized a non-autoregressive architecture that enabled parallel sequence generation, significantly reducing inference time while maintaining audio quality. Its feedforward transformer network with length regulation allowed for one-shot prediction of the full mel-spectrogram sequence, avoiding the sequential dependencies that bottlenecked previous approaches. [3] The same year saw the emergence of HiFi-GAN, a generative adversarial network (GAN)-based vocoder that improved the efficiency of waveform generation while producing high-fidelity speech. [4] This was followed by Glow-TTS, which introduced a flow-based approach that allowed for both fast inference and voice style transfer capabilities. [5]

In March 2020, a Massachusetts Institute of Technology researcher under the pseudonym 15 demonstrated data-efficient deep learning speech synthesis through 15.ai, a web application capable of generating high-quality speech using only 15 seconds of training data, [6] [7] compared to previous systems that required tens of hours. [8] The system implemented a unified multi-speaker model that enabled simultaneous training of multiple voices through speaker embeddings, allowing the model to learn shared patterns across different voices even when individual voices lacked examples of certain emotional contexts. [9] The platform integrated sentiment analysis through DeepMoji for emotional expression and supported precise pronunciation control via ARPABET phonetic transcriptions. [10] The 15-second data efficiency benchmark was later corroborated by OpenAI in 2024. [11]

Semi-supervised learning

Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases. [12] [13]

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings. [14] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform as a product of conditional probabilities as follows

where is the model parameter including many dilated convolution layers. Thus, each audio sample is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet [15] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow [16] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN, [17] which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

Audio deepfake technology, also referred to as voice cloning or deepfake audio, is an application of artificial intelligence designed to generate speech that convincingly mimics specific individuals, often synthesizing phrases or sentences they have never spoken. Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating audiobooks and assisting individuals who have lost their voices due to medical conditions. Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding text-to-speech systems, and advanced speech translation services.

Lyra is a lossy audio codec developed by Google that is designed for compressing speech at very low bitrates. Unlike most other audio formats, it compresses data using a machine learning-based algorithm.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

NSynth is a WaveNet-based autoencoder for synthesizing audio, outlined in a paper in April 2017.

<span class="mw-page-title-main">Neural scaling law</span> Statistical law in machine learning

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

<span class="mw-page-title-main">Audio inpainting</span>

Audio inpainting is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal. Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

15.ai was a free non-commercial web application that used artificial intelligence to generate text-to-speech voices of fictional characters from popular media. Created by an artificial intelligence researcher known as 15 during their time at the Massachusetts Institute of Technology, the application allowed users to make characters from video games, television shows, and movies speak custom text with emotional inflections faster than real-time. The platform was notable for its ability to generate convincing voice output using minimal training data—the name "15.ai" referenced the creator's claim that a voice could be cloned with just 15 seconds of audio. It was an early example of an application of generative artificial intelligence during the initial stages of the AI boom.

References

  1. 1 2 van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind . Retrieved 2022-06-05.
  2. "Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"". 2018-08-30. Archived from the original on 2020-11-11. Retrieved 2022-06-05.
  3. 1 2 Ren, Yi (2019). "FastSpeech: Fast, Robust and Controllable Text to Speech". arXiv: 1905.09263 [cs.CL].
  4. Kong, Jungil (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis". arXiv: 2010.05646 [cs.SD].
  5. Kim, Jaehyeon (2020). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". arXiv: 2005.11129 [eess.AS].
  6. Ng, Andrew (April 1, 2020). "Voice Cloning for the Masses". DeepLearning.AI. Archived from the original on December 28, 2024. Retrieved December 22, 2024.
  7. Chandraseta, Rionaldi (January 21, 2021). "Generate Your Favourite Characters' Voice Lines using Machine Learning". Towards Data Science. Archived from the original on January 21, 2021. Retrieved December 18, 2024.
  8. "Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"". 2018-08-30. Archived from the original on 2020-11-11. Retrieved 2022-06-05.
  9. Temitope, Yusuf (December 10, 2024). "15.ai Creator reveals journey from MIT Project to internet phenomenon". The Guardian. Archived from the original on December 28, 2024. Retrieved December 25, 2024.
  10. Kurosawa, Yuki (January 19, 2021). "ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる" [Game Character Voice Reading Software "15.ai" Now Available. Get Characters from Undertale and Portal to Say Your Desired Lines]. AUTOMATON (in Japanese). Archived from the original on January 19, 2021. Retrieved December 18, 2024.
  11. "Navigating the Challenges and Opportunities of Synthetic Voices". OpenAI. March 9, 2024. Archived from the original on November 25, 2024. Retrieved December 18, 2024.
  12. Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv: 1808.10128 [cs.CL].
  13. Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv: 1905.06791 [cs.CL].
  14. Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis". arXiv: 1806.04558 [cs.CL].
  15. van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv: 1711.10433 [cs.CL].
  16. Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis". arXiv: 1811.00002 [cs.SD].
  17. Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram". arXiv: 1910.11480 [eess.AS].