NSynth

Last updated
NSynth: Neural Audio Synthesis
Original author(s) Google Brain, Deep Mind, Magenta
Initial release6 April 2017;7 years ago (2017-04-06)
Repository github.com/magenta/magenta/tree/main/magenta/models/nsynth
Written in Python
Type Software synthesizer
License Apache 2.0
Website magenta.tensorflow.org/nsynth

NSynth (a portmanteau of "Neural Synthesis") is a WaveNet-based autoencoder for synthesizing audio, outlined in a paper in April 2017. [1]

Contents

Overview

The model generates sounds through a neural network based synthesis, employing a WaveNet-style autoencoder to learn its own temporal embeddings from four different sounds. [2] [3] Google then released an open source hardware interface for the algorithm called NSynth Super, [4] used by notable musicians such as Grimes and YACHT to generate experimental music using artificial intelligence. [5] [6] The research and development of the algorithm was part of a collaboration between Google Brain, Magenta and DeepMind. [7]

Technology

Dataset

The NSynth dataset is composed of 305,979 one-shot instrumental notes featuring a unique pitch, timbre, and envelope, sampled from 1,006 instruments from commercial sample libraries. [8] For each instrument the dataset contains four-second 16 kHz audio snippets by ranging over every pitch of a standard MIDI piano, as well as five different velocities. [9] The dataset is made available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. [10]

Machine learning model

A spectral autoencoder model and a WaveNet autoencoder model are publicly available on GitHub. [11] The baseline model uses a spectrogram with fft_size 1024 and hop_size 256, MSE loss on the magnitudes, and the Griffin-Lim algorithm for reconstruction. The WaveNet model trains on mu-law encoded waveform chunks of size 6144. It learns embeddings with 16 dimensions that are downsampled by 512 in time. [12]

NSynth Super

NSynth Super
Nsynth-super.jpg
NSynth Super Front Panel
Manufacturer Google Brain, Google Creative Lab
Dates2018
Technical specifications
Synthesis type Neural Network Sample-based synthesis
Input/output
Left-hand control Pitch bend, ADSR
External control MIDI

In 2018 Google released a hardware interface for the NSynth algorithm, called NSynth Super, designed to provide an accessible physical interface to the algorithm for musicians to use in their artistic production. [13] [14]

Design files, source code and internal components are released under an open source Apache License 2.0, [15] enabling hobbyists and musicians to freely build and use the instrument. [16] At the core of the NSynth Super there is a Raspberry Pi, extended with a custom printed circuit board to accommodate the interface elements. [17]

Influence

Despite not being publicly available as a commercial product, NSynth Super has been used by notable artists, including Grimes and YACHT. [18] [19]

Grimes reported using the instrument in her 2020 studio album Miss Anthropocene. [5]

YACHT announced an extensive use of NSynth Super in their album Chain Tripping. [20]

Claire L. Evans compared the potential influence of the instrument to the Roland TR-808. [21]

The NSynth Super design was honored with a D&AD Yellow Pencil award in 2018. [22]

Related Research Articles

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Music and artificial intelligence (AI) is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Timothy P. Lillicrap is a Canadian neuroscientist and AI researcher, adjunct professor at University College London, and staff research scientist at Google DeepMind, where he has been involved in the AlphaGo and AlphaZero projects mastering the games of Go, Chess and Shogi. His research focuses on machine learning and statistics for optimal control and decision making, as well as using these mathematical frameworks to understand how the brain learns. He has developed algorithms and approaches for exploiting deep neural networks in the context of reinforcement learning, and new recurrent memory architectures for one-shot learning.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

The Fashion MNIST dataset is a large freely available database of fashion images that is commonly used for training and testing various machine learning systems. Fashion-MNIST was intended to serve as a replacement for the original MNIST database for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Lyra is a lossy audio codec developed by Google that is designed for compressing speech at very low bitrates. Unlike most other audio formats, it compresses data using a machine learning-based algorithm.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

References

  1. Engel, Jesse; Resnick, Cinjon; Roberts, Adam; Dieleman, Sander; Eck, Douglas; Simonyan, Karen; Norouzi, Mohammad (2017). "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders". arXiv: 1704.01279 [cs.LG].
  2. Engel, Jesse; Resnick, Cinjon; Roberts, Adam; Dieleman, Sander; Eck, Douglas; Simonyan, Karen; Norouzi, Mohammad (2017). "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders". research.google. arXiv: 1704.01279 .
  3. Aaron van den Oord; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016). "WaveNet: A Generative Model for Raw Audio". arXiv: 1609.03499 [cs.SD].
  4. "Google's open-source neural synth is creating totally new sounds". Wired UK.
  5. 1 2 "73 | Grimes (c) on Music, Creativity, and Digital Personae – Sean Carroll". www.preposterousuniverse.com.
  6. Mattise, Nathan (2019-08-31). "How YACHT fed their old music to the machine and got a killer new album". Ars Technica. Retrieved 2022-11-08.
  7. "NSynth: Neural Audio Synthesis". Magenta. 6 April 2017.
  8. "NSynth Dataset". Machine Learning Datasets. Retrieved 2022-11-08.
  9. Ramires, António; Serra, Xavier (2019). "Data Augmentation for Instrument Classification Robust to Audio Effects". arXiv: 1907.08520 [cs.SD].
  10. "The NSynth Dataset". tensorflow.org. 5 April 2017.
  11. "NSynth: Neural Audio Synthesis". GitHub.
  12. Engel, Jesse; Resnick, Cinjon; Roberts, Adam; Dieleman, Sander; Eck, Douglas; Simonyan, Karen; Norouzi, Mohammad (2017). "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders". arXiv: 1704.01279 [cs.LG].
  13. "NSynth Super is an AI-backed touchscreen synth". The Verge. 13 March 2018.
  14. "Google built a musical instrument that uses AI and released the plans so you can make your own". CNBC. 13 March 2018.
  15. "googlecreativelab/open-nsynth-super". April 1, 2021 via GitHub.
  16. "Open NSynth Super". hackaday.io. Retrieved 2022-11-08.
  17. "NSYNTH SUPER Hardware". GitHub.
  18. Mattise, Nathan. "How YACHT Used Machine Learning to Create Their New Album". Wired. ISSN   1059-1028 . Retrieved 2023-01-19.
  19. "Cover Story: Grimes is ready to play the villain". Crack Magazine. Retrieved 2023-01-19.
  20. "What Machine-Learning Taught the Band YACHT About Themselves". Los Angeleno. 2019-09-18. Retrieved 2023-01-19.
  21. Music and Machine Learning (Google I/O'19) , retrieved 2023-01-19
  22. "NSynth Super | Google Creative Lab | Google | D&AD Awards 2018 Pencil Winner | Interactive Design for Products | D&AD". www.dandad.org. Retrieved 2023-01-19.

Further reading