Riffusion

Last updated
Riffusion
Developer(s)
  • Seth Forsgren
  • Hayk Martiros
Initial releaseDecember 15, 2022
Repository github.com/hmartiro/riffusion-inference
Written in Python
Type Text-to-image model
License MIT License
Website riffusion.com
AI-generated audio featuring bossa nova music with electric guitar.png
Generated spectrogram from the prompt "bossa nova with electric guitar" (top), and the resulting audio after conversion (bottom)

Riffusion is a neural network, designed by Seth Forsgren and Hayk Martiros, that generates music using images of sound rather than audio. [1] It was created as a fine-tuning of Stable Diffusion, an existing open-source model for generating images from text prompts, on spectrograms. [1] This results in a model which uses text prompts to generate image files, which can be put through an inverse Fourier transform and converted into audio files. [2] While these files are only several seconds long, the model can also use latent space between outputs to interpolate different files together. [1] [3] This is accomplished using a functionality of the Stable Diffusion model known as img2img. [4]

The resulting music has been described as "de otro mundo" (otherworldly), [5] although unlikely to replace man-made music. [5] The model was made available on December 15, 2022, with the code also freely available on GitHub. [2] It is one of many models derived from Stable Diffusion. [4]

Riffusion is classified within a subset of AI text-to-music generators. In December 2022, Mubert [6] similarly used Stable Diffusion to turn descriptive text into music loops. In January 2023, Google published a paper on their own text-to-music generator called MusicLM. [7] [8]

Related Research Articles

<span class="mw-page-title-main">Music and artificial intelligence</span> Common subject in the International Computer Music Conference

Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at Michigan State University. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of artificial intelligence (AI) programs.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts". The original DALL·E was revealed by OpenAI in a blog post in 5 January 2021, and uses a version of GPT-3 modified to generate images. In 6 April 2022, OpenAI announced DALL·E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles". In September 2023, OpenAI announced their latest image model, DALL·E 3, capable of understanding "significantly more nuance and detail" than previous iterations.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">You.com</span> Search engine

You.com is a search engine. It allows people to personalize their search by letting its users upvote, downvote, or block apps. It provides additional products including their chatbot called YouChat, a writing tool called YouWrite, and AI-image generator called YouImagine, which utilizes models like Stable Diffusion and Open Journey.

<span class="mw-page-title-main">Midjourney</span> Image-generating machine learning model

Midjourney is a generative artificial intelligence program and service created and hosted by San Francisco–based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. It was developed by researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a compute donation by Stability AI and training data from non-profit organizations.

<span class="mw-page-title-main">Loab</span> Image found using AI text-to-image software

Loab is a fictional character that artist and writer Steph Maj Swanson has claimed to have discovered with an unspecified text-to-image AI model in April 2022. In a viral Twitter thread, Swanson described it as an unexpectedly emergent property of the software, saying they discovered it when asking the model to produce something "as different from the prompt as possible".

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 3, Google Brain's Imagen, StabilityAI's Stable Diffusion, and Midjourney began to approach the quality of real photographs and human-drawn art.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

<span class="mw-page-title-main">NovelAI</span> Online service for AI media creation

NovelAI is an online cloud-based, SaaS model, paid subscription service for AI-assisted storywriting and text-to-image synthesis, originally launched in beta on June 15, 2021, with the image generation feature later implemented on October 3, 2022. NovelAI is owned and operated by Anlatan, which is headquartered in Wilmington, Delaware.

A text-to-video model is a machine learning model which takes as input a natural language description and produces a video matching that description.

<span class="mw-page-title-main">DreamBooth</span> Deep learning generation model

DreamBooth is a deep learning generation model used to personalize existing text-to-image models by fine-tuning. It was developed by researchers from Google Research and Boston University in 2022. Originally developed using Google's own Imagen text-to-image model, DreamBooth implementations can be applied to other text-to-image models, where it can allow the model to generate more fine-tuned and personalized outputs after training on three to five images of a subject.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media, using generative models. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

<span class="mw-page-title-main">AI boom</span> Rapid progress in generative AI since mid-2010s

The AI boom refers to an ongoing period of rapid and unprecedented development in the field of artificial intelligence, with the generative AI race being a key component of this boom, which began in earnest with the founding of OpenAI in 2016 or 2017. OpenAI's generative AI systems, such as its various GPT models and DALL-E (2021), have played a significant role in driving this development.

In the 2020s, the rapid increase in the capabilities of deep learning-based generative artificial intelligence models, including text-to-image models such as Stable Diffusion and large language models such as ChatGPT, posed questions of how copyright law applies to the training and use of such models. Because there is limited existing case law, experts consider this area to be fraught with uncertainty.

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data, is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects or more abstract categories.

Runway AI, Inc. is an American company headquartered in New York City that specializes in generative artificial intelligence research and technologies. The company is primarily focused on creating products and models for generating videos, images, and various multimedia content. It is most notable for developing the first commercial text-to-video generative AI models Gen-1 and Gen-2 and co-creating the research for the popular image generation AI system Stable Diffusion.

References

  1. 1 2 3 Coldewey, Devin (December 15, 2022). "Try 'Riffusion,' an AI model that composes music by visualizing it".
  2. 1 2 Nasi, Michele (December 15, 2022). "Riffusion: creare tracce audio con l'intelligenza artificiale". IlSoftware.it.
  3. "Essayez "Riffusion", un modèle d'IA qui compose de la musique en la visualisant". December 15, 2022.
  4. 1 2 "文章に沿った楽曲を自動生成してくれるAI「Riffusion」登場、画像生成AI「Stable Diffusion」ベースで誰でも自由に利用可能". GIGAZINE.
  5. 1 2 Llano, Eutropio (December 15, 2022). "El generador de imágenes AI también puede producir música (con resultados de otro mundo)".
  6. "Mubert launches Text-to-Music interface – a completely new way to generate music from a single text prompt". December 21, 2022.
  7. "MusicLM: Generating Music From Text". January 26, 2023.
  8. "5 Reasons Google's MusicLM AI Text-to-Music App is Different". January 27, 2023.