WaveNet

Last updated

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, [1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. [2] WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music. [3]

Contents

History

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant. [4]

Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words. [5] The most common of these is called concatenative TTS. [6] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone. [7] The reliance on a recorded library also makes it difficult to modify or change the voice. [8]

Another technique, known as parametric TTS, [9] uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

Design and ongoing research

Background

A stack of dilated casual convolutional layers WaveNet animation.gif
A stack of dilated casual convolutional layers

WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values. [11]

Initial concept and results

According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio, [12] the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language. [13]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech. [14] The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music. [15]

Content (voice) swapping

According to the June 2018 paper Disentangled Sequential Autoencoder , [16] DeepMind has successfully used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio recording for another, pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other.

The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders [17] details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample Efficient Adaptive Text-to-Speech, [18] dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.

Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis. [19]

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. [20] As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms. [21] In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation". [22] At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples. [23]

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on given observable variable X and target variable Y;
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x; and
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".

Packet loss concealment (PLC) is a technique to mask the effects of packet loss in voice over IP (VoIP) communications. When the voice signal is sent as VoIP packets on an IP network, the packets may travel different routes. A packet therefore might arrive very late, might be corrupted, or simply might not arrive at all. One example case of the last situation could be, when a packet is rejected by a server which has a full buffer and cannot accept any more data. Other cases include network congestion resulting in significant delay. In a VoIP connection, error-control techniques such as automatic repeat request (ARQ) are not feasible and the receiver should be able to cope with packet loss. Packet loss concealment is the inclusion in a design of methodologies for accounting for and compensating for the loss of voice packets.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at Michigan State University. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combined open-ended machine learning research with information systems and large-scale computing resources. The team has created tools such as TensorFlow, which allow for neural networks to be used by the public, with multiple internal AI research projects. The team aims to create research opportunities in machine learning and natural language processing. The team was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

<span class="mw-page-title-main">Google DeepMind</span> Artificial intelligence division

DeepMind Technologies Limited, doing business as Google DeepMind, is a British-American artificial intelligence research laboratory which serves as a subsidiary of Google. Founded in the UK in 2010, it was acquired by Google in 2014, The company is based in London, with research centres in Canada, France, Germany and the United States.

<span class="mw-page-title-main">Speech Recognition & Synthesis</span> Screen reader application by Google

Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

Tygem is an internet go server owned by South Korean company TongYang Online. Popular in Asia, their website states that over 500 professional Go players use their service.

Aja Huang is a Taiwanese computer scientist and expert on artificial intelligence. He works for DeepMind and was a member of the AlphaGo project.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Timothy P. Lillicrap is a Canadian neuroscientist and AI researcher, adjunct professor at University College London, and staff research scientist at Google DeepMind, where he has been involved in the AlphaGo and AlphaZero projects mastering the games of Go, Chess and Shogi. His research focuses on machine learning and statistics for optimal control and decision making, as well as using these mathematical frameworks to understand how the brain learns. He has developed algorithms and approaches for exploiting deep neural networks in the context of reinforcement learning, and new recurrent memory architectures for one-shot learning.

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Lyra is a lossy audio codec developed by Google that is designed for compressing speech at very low bitrates. Unlike most other audio formats, it compresses data using a machine learning-based algorithm.

Oriol Vinyals is a Spanish machine learning researcher at DeepMind, where he is the principal research scientist. His research in DeepMind is regularly featured in the mainstream media especially after being acquired by Google.

NSynth is a WaveNet-based autoencoder for synthesizing audio, outlined in a paper in April 2017.

References

  1. van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  2. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  3. Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized Speech". Fortune. Retrieved 2017-07-06.
  4. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  5. Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen". MIT Technology Review. Retrieved 2017-07-06.
  6. Hunt, A. J.; Black, A. W. (May 1996). "Unit selection in a concatenative speech synthesis system using a large speech database". 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (PDF). Vol. 1. pp. 373–376. CiteSeerX   10.1.1.218.1335 . doi:10.1109/ICASSP.1996.541110. ISBN   978-0-7803-3192-1. S2CID   14621185.
  7. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  8. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  9. Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech synthesis". Speech Communication. 51 (11): 1039–1064. CiteSeerX   10.1.1.154.9874 . doi:10.1016/j.specom.2009.04.004. S2CID   3232238.
  10. van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind . Retrieved 2022-06-05.
  11. Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  12. Aaron van den Oord; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  13. Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike". Quartz. Retrieved 2017-07-06.
  14. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  15. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  16. Li, Yingzhen; Mandt, Stephan (2018). "Disentangled Sequential Autoencoder". arXiv: 1803.02991 [cs.LG].
  17. Chorowski, Jan; Weiss, Ron J.; Bengio, Samy; Van Den Oord, Aaron (2019). "Unsupervised Speech Representation Learning Using WaveNet Autoencoders". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27 (12): 2041–2053. arXiv: 1901.08810 . doi:10.1109/TASLP.2019.2938863.
  18. Chen, Yutian; Assael, Yannis; Shillingford, Brendan; Budden, David; Reed, Scott; Zen, Heiga; Wang, Quan; Cobo, Luis C.; Trask, Andrew; Laurie, Ben; Gulcehre, Caglar; Aäron van den Oord; Vinyals, Oriol; Nando de Freitas (2018). "Sample Efficient Adaptive Text-to-Speech". arXiv: 1809.10460 [cs.LG].
  19. Adobe Voco 'Photoshop-for-voice' causes concern, 7 November 2016, BBC
  20. "Adobe Voco 'Photoshop-for-voice' causes concern". BBC News. 2016-11-07. Retrieved 2017-07-06.
  21. WaveNet launches in the Google Assistant
  22. Aaron van den Oord; Li, Yazhe; Babuschkin, Igor; Simonyan, Karen; Vinyals, Oriol; Kavukcuoglu, Koray; George van den Driessche; Lockhart, Edward; Cobo, Luis C.; Stimberg, Florian; Casagrande, Norman; Grewe, Dominik; Noury, Seb; Dieleman, Sander; Elsen, Erich; Kalchbrenner, Nal; Zen, Heiga; Graves, Alex; King, Helen; Walters, Tom; Belov, Dan; Hassabis, Demis (2017). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv: 1711.10433 [cs.LG].
  23. Martin, Taylor (May 9, 2018). "Try the all-new Google Assistant voices right now". CNET. Retrieved May 10, 2018.