WaveNet

Last updated

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, [1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. [2] WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music. [3]

Contents

History

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant. [4]

Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words. [5] The most common of these is called concatenative TTS. [6] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone. [7] The reliance on a recorded library also makes it difficult to modify or change the voice. [8]

Another technique, known as parametric TTS, [9] uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

Design and ongoing research

Background

A stack of dilated casual convolutional layers WaveNet animation.gif
A stack of dilated casual convolutional layers

WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values. [11]

Initial concept and results

According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio, [12] the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language. [13]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech. [14] The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music. [15]

Content (voice) swapping

According to the June 2018 paper Disentangled Sequential Autoencoder , [16] DeepMind has successfully used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio recording for another, pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other.

The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders [17] details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample Efficient Adaptive Text-to-Speech, [18] dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.

Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis. [19]

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. [20] As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms. [21] In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation". [22] At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples. [23]

See also

Related Research Articles

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on a given observable variable X and target variable Y; A generative model can be used to "generate" random instances (outcomes) of an observation x.
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x. It can be used to "discriminate" the value of the target variable Y, given an observation x.
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".

Packet loss concealment (PLC) is a technique to mask the effects of packet loss in voice over IP (VoIP) communications. When the voice signal is sent as VoIP packets on an IP network, the packets may travel different routes. A packet therefore might arrive very late, might be corrupted, or simply might not arrive at all. One example case of the last situation could be, when a packet is rejected by a server which has a full buffer and cannot accept any more data. Other cases include network congestion resulting in significant delay. In a VoIP connection, error-control techniques such as automatic repeat request (ARQ) are not feasible and the receiver should be able to cope with packet loss. Packet loss concealment is the inclusion in a design of methodologies for accounting for and compensating for the loss of voice packets.

Music and artificial intelligence (AI) is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

DeepMind Technologies Limited, also known by its trade name Google DeepMind, is a British-American artificial intelligence research laboratory which serves as a subsidiary of Google. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The company is based in London, with research centres in Canada, France, Germany, and the United States.

<span class="mw-page-title-main">Speech Recognition & Synthesis</span> Screen reader application by Google

Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Tygem is an internet go server owned by South Korean company TongYang Online. Popular in Asia, their website states that over 500 professional Go players use their service.

<span class="mw-page-title-main">Ian Goodfellow</span> American computer scientist (born 1987)

Ian J. Goodfellow is an American computer scientist, engineer, and executive, most noted for his work on artificial neural networks and deep learning. He is a research scientist at Google DeepMind, was previously employed as a research scientist at Google Brain and director of machine learning at Apple, and has made several important contributions to the field of deep learning, including the invention of the generative adversarial network (GAN). Goodfellow co-wrote, as the first author, the textbook Deep Learning (2016) and wrote the chapter on deep learning in the authoritative textbook of the field of artificial intelligence, Artificial Intelligence: A Modern Approach.

Aja Huang is a Taiwanese computer scientist and expert on artificial intelligence. He works for DeepMind and was a member of the AlphaGo project.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Timothy P. Lillicrap is a Canadian neuroscientist and AI researcher, adjunct professor at University College London, and staff research scientist at Google DeepMind, where he has been involved in the AlphaGo and AlphaZero projects mastering the games of Go, Chess and Shogi. His research focuses on machine learning and statistics for optimal control and decision making, as well as using these mathematical frameworks to understand how the brain learns. He has developed algorithms and approaches for exploiting deep neural networks in the context of reinforcement learning, and new recurrent memory architectures for one-shot learning.

15.ai was a freeware artificial intelligence web application that generated text-to-speech voices from fictional characters from various media sources. Created by a pseudonymous developer under the alias 15, the project used a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate emotive character voices faster than real-time.

Audio deepfake technology, also referred to as voice cloning or deepfake audio, is an application of artificial intelligence designed to generate speech that convincingly mimics specific individuals, often synthesizing phrases or sentences they have never spoken. Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating audiobooks and assisting individuals who have lost their voices due to medical conditions. Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding text-to-speech systems, and advanced speech translation services.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Lyra is a lossy audio codec developed by Google that is designed for compressing speech at very low bitrates. Unlike most other audio formats, it compresses data using a machine learning-based algorithm.

Oriol Vinyals is a Spanish machine learning researcher at DeepMind. He is currently technical lead on Gemini, along with Noam Shazeer and Jeff Dean.

NSynth is a WaveNet-based autoencoder for synthesizing audio, outlined in a paper in April 2017.

References

  1. van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  2. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  3. Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized Speech". Fortune. Retrieved 2017-07-06.
  4. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  5. Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen". MIT Technology Review. Retrieved 2017-07-06.
  6. Hunt, A. J.; Black, A. W. (May 1996). "Unit selection in a concatenative speech synthesis system using a large speech database". 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (PDF). Vol. 1. pp. 373–376. CiteSeerX   10.1.1.218.1335 . doi:10.1109/ICASSP.1996.541110. ISBN   978-0-7803-3192-1. S2CID   14621185.
  7. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  8. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  9. Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech synthesis". Speech Communication. 51 (11): 1039–1064. CiteSeerX   10.1.1.154.9874 . doi:10.1016/j.specom.2009.04.004. S2CID   3232238.
  10. van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind . Retrieved 2022-06-05.
  11. Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  12. Aaron van den Oord; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  13. Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike". Quartz. Retrieved 2017-07-06.
  14. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  15. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  16. Li, Yingzhen; Mandt, Stephan (2018). "Disentangled Sequential Autoencoder". arXiv: 1803.02991 [cs.LG].
  17. Chorowski, Jan; Weiss, Ron J.; Bengio, Samy; Van Den Oord, Aaron (2019). "Unsupervised Speech Representation Learning Using WaveNet Autoencoders". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27 (12): 2041–2053. arXiv: 1901.08810 . doi:10.1109/TASLP.2019.2938863.
  18. Chen, Yutian; Assael, Yannis; Shillingford, Brendan; Budden, David; Reed, Scott; Zen, Heiga; Wang, Quan; Cobo, Luis C.; Trask, Andrew; Laurie, Ben; Gulcehre, Caglar; Aäron van den Oord; Vinyals, Oriol; Nando de Freitas (2018). "Sample Efficient Adaptive Text-to-Speech". arXiv: 1809.10460 [cs.LG].
  19. Adobe Voco 'Photoshop-for-voice' causes concern, 7 November 2016, BBC
  20. "Adobe Voco 'Photoshop-for-voice' causes concern". BBC News. 2016-11-07. Retrieved 2017-07-06.
  21. WaveNet launches in the Google Assistant
  22. Aaron van den Oord; Li, Yazhe; Babuschkin, Igor; Simonyan, Karen; Vinyals, Oriol; Kavukcuoglu, Koray; George van den Driessche; Lockhart, Edward; Cobo, Luis C.; Stimberg, Florian; Casagrande, Norman; Grewe, Dominik; Noury, Seb; Dieleman, Sander; Elsen, Erich; Kalchbrenner, Nal; Zen, Heiga; Graves, Alex; King, Helen; Walters, Tom; Belov, Dan; Hassabis, Demis (2017). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv: 1711.10433 [cs.LG].
  23. Martin, Taylor (May 9, 2018). "Try the all-new Google Assistant voices right now". CNET. Retrieved May 10, 2018.