WaveNet

Last updated

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based artificial intelligence firm DeepMind. The technique, outlined in a paper in September 2016, [1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. [2] WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music. [3]

Neural network Structure in biology and artificial intelligence

A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus a neural network is either a biological neural network, made up of real biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

DeepMind artificial intelligence company owned by Google

DeepMind Technologies is a UK company founded in September 2010, currently owned by Alphabet Inc. The company is based in London, with research centres in Canada, France, and the United States.

Contents

Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis. [4]

Adobe Voco will be an audio editing and generating prototype software by Adobe that enables novel editing and generation of audio. Dubbed "Photoshop-for-voice", it was first previewed at the Adobe MAX event in November 2016. The technology shown at Adobe MAX was a preview that could potentially be incorporated into Adobe Creative Cloud. As of September 19, 2019, Adobe has yet to release any further information about a potential release date.

History

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant. [5]

Siri AI personal assistant

Siri is a virtual assistant that is part of Apple Inc.'s iOS, iPadOS, watchOS, macOS, tvOS and audioOS operating systems The assistant uses voice queries and a natural-language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Internet services. The software adapts to users' individual language usages, searches, and preferences, with continuing use. Returned results are individualized.

Cortana intelligent personal assistant

Cortana is a virtual assistant created by Microsoft for Windows 10, Windows 10 Mobile, Windows Phone 8.1, Invoke smart speaker, Microsoft Band, Surface Headphones, Xbox One, iOS, Android, Windows Mixed Reality, and Amazon Alexa.

Amazon Alexa, known simply as Alexa, is a virtual assistant developed by Amazon, first used in the Amazon Echo and the Amazon Echo Dot smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills".

Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words. [6] The most common of these is called concatenative TTS. [7] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone. [8] The reliance on a recorded library also makes it difficult to modify or change the voice. [9]

Another technique, known as parametric TTS, [10] uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

Vocoder electronic device

A vocoder is a category of voice codec that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption, voice transformation, etc.

Design and ongoing research

Background

WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values. [11]

Feedforward neural network artificial neural network wherein connections between the nodes do not form a cycle

A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from recurrent neural networks.

Convolutional neural network artificial neural network

In deep learning, a convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery.

In mathematics, the softmax function, also known as softargmax or normalized exponential function, is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.

Initial concept and results

According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio [12] , the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language. [13]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech. [14] The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music. [15]

Content (voice) swapping

According to the June 2018 paper Disentangled Sequential Autoencoder [16] , DeepMind has successfully used WaveNet for "content swapping" also in regards to audio and voice content, which basically means that the voice on any given audio recording can be swapped for any other pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other.

The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders [17] details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample Efficient Adaptive Text-to-Speech [18] , dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.

Lip-reading

The July 2018 paper Large-Scale Visual Speech Recognition (latest revision October 2018) [19] details efficient, automatic lip-reading from existing video recordings, no matter whether audio is present or not, to be developed as a potential plugin for WaveNet called LipNet, claiming that it outperforms professional human lip-readers by a high margin with a remarkably lower error rate.

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. [20] As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms. [21] In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation". [22] At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples. [23]

Related Research Articles

Speech recognition is a interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

Music technology (electronic and digital) Music technology

Digital music technology encompasses digital instruments, computers, electronic effects units, software, or digital audio equipment by a performer, composer, sound engineer, DJ, or record producer to produce, perform or record music. The term refers to electronic devices, instruments, computer hardware, and software used in performance, playback, recording, composition, mixing, analysis, and editing of music.

Wavetable synthesis is a sound synthesis technique used to create periodic waveforms. Often used in the production of musical tones or notes, it was developed by Wolfgang Palm of Palm Products GmbH (PPG) in the late 1970s and published in 1979, and has since been used as the primary synthesis method in synthesizers built by PPG and Waldorf Music and as an auxiliary synthesis method by Ensoniq and Access. It is currently used in software-based synthesizers for PCs and tablets, including apps offered by PPG and Waldorf, among others.

Unsupervised learning machine learning technique

Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs. It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.

Generative topographic map (GTM) is a machine learning method that is a probabilistic counterpart of the self-organizing map (SOM), is probably convergent and does not require a shrinking neighborhood or a decreasing step size. It is a generative model: the data is assumed to arise by first probabilistically picking a point in a low-dimensional space, mapping the point to the observed high-dimensional input space, then adding noise in that space. The parameters of the low-dimensional probability distribution, the smooth map and the noise are all learned from the training data using the expectation-maximization (EM) algorithm. GTM was introduced in 1996 in a paper by Christopher Bishop, Markus Svensen, and Christopher K. I. Williams.

Human image synthesis

Human image synthesis can be applied to make believable and even photorealistic renditions of human-likenesses, moving or still. This has effectively been the situation since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work.

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties. Examples are the regularized autoencoders, proven effective in learning representations for subsequent classification tasks, and Variational autoencoders, with their recent applications as generative models. Autoencoders are effectively used for solving many applied problems, from face recognition to acquiring the semantic meaning for the words.

Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing undeterministic phones or giving preponderance among near probability decisions.

An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.

Concatenative synthesis is a technique for synthesising sounds by concatenating short samples of recorded sound. The duration of the units is not strictly defined and may vary according to the implementation, roughly in the range of 10 milliseconds up to 1 second. It is used in speech synthesis and music sound synthesis to generate user-specified sequences of sound from a database built from recordings of other sequences.

Deep learning Branch of machine learning

Deep learning is part of a broader family of machine learning methods based on artificial neural networks. Learning can be supervised, semi-supervised or unsupervised.

Google Text-to-Speech Screenreader

Google Text-to-Speech is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, by Google Translate for reading aloud translations providing useful insight to the pronunciation of words, by Google Talkback and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

Generative adversarial network Deep learning method

A generative adversarial network (GAN) is a class of machine learning systems invented by Ian Goodfellow and his colleagues in 2014. Two neural networks contest with each other in a game. Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proven useful for semi-supervised learning, fully supervised learning, and reinforcement learning. In a 2016 seminar, Yann LeCun described GANs as "the coolest idea in machine learning in the last twenty years".

Alex Graves is a research scientist at DeepMind. He did a BSc in Theoretical Physics at Edinburgh and obtained a PhD in AI under Jürgen Schmidhuber at IDSIA. He was also a postdoc at TU Munich and under Geoffrey Hinton at the University of Toronto.

Generative audio

Generative audio refers the creation of audio files from huge databases of audio clips. Creating phrases and sentences that may have never been actually spoken. The technology differs from AI voices such as the Apple Siri and the Amazon Alexa as those use a collection of fragments that are stitched together, on demand.

References

  1. Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". 1609. arXiv: 1609.03499 . Bibcode:2016arXiv160903499V.Cite journal requires |journal= (help)
  2. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  3. Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized Speech". Fortune. Retrieved 2017-07-06.
  4. Adobe Voco 'Photoshop-for-voice' causes concern, 7 November 2016, BBC
  5. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  6. Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen". MIT Technology Review. Retrieved 2017-07-06.
  7. Hunt, A. J.; Black, A. W. (May 1996). Unit selection in a concatenative speech synthesis system using a large speech database (PDF). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. 1. pp. 373–376. CiteSeerX   10.1.1.218.1335 . doi:10.1109/ICASSP.1996.541110. ISBN   978-0-7803-3192-1.
  8. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  9. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  10. Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech synthesis". Speech Communication. 51 (11): 1039–1064. CiteSeerX   10.1.1.154.9874 . doi:10.1016/j.specom.2009.04.004.
  11. Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". 1609. arXiv: 1609.03499 . Bibcode:2016arXiv160903499V.Cite journal requires |journal= (help)
  12. Oord et al. (2016). [WaveNet: A Generative Model for Raw Audio https://arxiv.org/pdf/1609.03499.pdf], Cornell University, 19 September 2016
  13. Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike". Quartz. Retrieved 2017-07-06.
  14. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  15. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  16. Li & Mand (2016). Disentangled Sequential Autoencoder, 12 June 2018, Cornell University
  17. Chorowsky et al. (2019). Unsupervised speech representation learning using WaveNet autoencoders, 25 January 2019, Cornell University
  18. Chen et al. (2018). Sample Efficient Adaptive Text-to-Speech, 27 September 2018, Cornell University. Also see this paper's latest January 2019 revision.
  19. Shillingford et al. (2018). Large-Scale Visual Speech Recognition, 13 July 2018, Cornell University
  20. "Adobe Voco 'Photoshop-for-voice' causes concern". BBC News. 2016-11-07. Retrieved 2017-07-06.
  21. WaveNet launches in the Google Assistant
  22. Oord et al. (2017): Parallel WaveNet: Fast High-Fidelity Speech Synthesis, Cornell University, 28 November 2017
  23. Martin, Taylor (May 9, 2018). "Try the all-new Google Assistant voices right now". CNET. Retrieved May 10, 2018.