Developer(s) | Udio |
---|---|
Initial release | April 10, 2024 |
Stable release | v1.5 / July 23, 2024 |
Type | Generative artificial intelligence |
Website | udio.com |
Udio is a generative artificial intelligence model that produces music based on simple text prompts. It can generate vocals and instrumentation. Its free beta version was released publicly on April 10, 2024. Users can pay to subscribe monthly or annually to unlock more capabilities such as audio inpainting.
Founded in December 2023 by a team of former researchers for Google DeepMind headed by Udio's CEO, David Ding, the program received financial backing from the venture capital firm Andreessen Horowitz and musicians will.i.am and Common, among others. Critics praised its ability to create realistic-sounding vocals while others raised concerns over the possibility that its training data contained copyrighted music.
Udio was created in December 2023 by a team of four former researchers for Google DeepMind, including Udio's CEO David Ding, Conor Durkan, Charlie Nash, Yaroslav Ganin, as well as Andrew Sanchez [1] [2] under the name of Uncharted Labs. [3] The venture capital firm Andreessen Horowitz; the music distributor UnitedMasters; musicians will.i.am, Tay Keith, and Common; investor Kevin Wall; Instagram cofounder Mike Krieger; and DeepMind researcher Oriol Vinyals all provided financial backing for Udio, and it was valued at $10 million in seed funding (plus the original $8.5 million raised previously). [3] [4] It spent several months in a closed beta phase before being publicly released in its beta phase on April 10, 2024 on the Udio website. [5] As of April 2024 [update] , it allows users to generate 600 songs per month for free. [6] Sanchez described it as "enabl[ing musicians] to create great music and ... to make money off of that music in the future". [1] Udio's release followed the releases of other text-to-music generators such as Suno AI and Stability Audio. [7]
Udio was used to create "BBL Drizzy" by Willonius Hatcher, a parody song that went viral in the context of the Drake–Kendrick Lamar feud, with over 23 million views on Twitter and 3.3 million streams on SoundCloud the first week. [8]
In August 2024, Verknallt in einen Talahon (In Love with a Talahon ) a song generated with Udio by Austrian producer Butterbro became the first AI-generated song in the German Top 50. [9]
Udio bases the songs it creates on text prompts, which can include their genre (including barbershop quartet, country, classical, hip hop, German pop, and hard rock, among others), lyrics, story direction, and other artists to base their sound on. Its lyrics are created with a large language model (LLM), while the process used to generate the music itself, as of April 2024 [update] , has not been disclosed. [10] The program generates two songs based on the prompts and users can "remix" their songs with further text prompts. [11] Songs are first generated as roughly 30 second-long pieces, and can be extended by additional 30 second increments. [6] Paying subscribers can access advanced functionality such as audio inpainting. [12] [13]
Mark Hachman, the senior editor of PC World , compared Udio to AI art generators and praised its ability to turn "a few rather poor lyrics" into a "rather catchy" song, also calling the vocals it generated "incredibly realistic and even emotional". [6] Sabrina Ortiz of ZDNET described the songs it generated as being "impressive" and sounding "as though they were produced professionally". She also called them "fuller and richer" than those of other text-to-music generators, which she said it had "more personalization options" than. [5] Tom's Guide 's Ryan Morrison wrote that Udio had "an uncanny ability to capture emotion in synthetic vocals" and was the only AI music generator "to have captured the passion, pain and spirit of a vocal performance". [14] He added that the program was geared toward "people with no or minimal musical ability". [2] Brian Hiatt of Rolling Stone wrote that Udio was "more customizable but also perhaps less intuitive to use" than Suno AI and added that "some early users have suggested that on average, Udio's output may sound crisper than Suno's". [1]
For Ars Technica , Benj Edwards wrote that Udio's generation capability was imperfect and "less impressive" than Suno AI's, noting that its songs were substantially shorter than Suno AI's. He also called the songs it produced "half-baked and almost nightmarish". [10] In response to the company's announcement of Udio's beta release on Twitter, Telefon Tel Aviv member Joshua Eustis tweeted that Udio was "an app to replace musicians" and called into question the data that it used. Udio has also been criticized online as "soulless" and for having the potential to create audio deepfakes. [11] [7] Lucas Ropek of Gizmodo stated that Udio was "full of acoustical nonsense" and that its songs were "extraordinarily bad". [15]
Critics of Udio have questioned what data was used to train it and if that data consisted of copyrighted music. Rolling Stone wrote that there was "substantial reason to believe" that both Udio and Suno AI were trained with copyrighted music, while Benj Edwards of Ars Technica wrote that its training data was "likely filled with copyrighted material". [10] [11] Udio does not directly recreate copyrighted songs if prompted. [6] Ding has stated that Udio has "extensive automated copyright filters" and that the company is "continually refining [its] safeguards". [7] Stability AI took a different approach with Stable Audio 2.0, and used an explicitly licensed dataset of music called AudioSparx. [16]
In June 2024, a lawsuit, lead by the Recording Industry Association of America, was filed against Udio and Suno alleging widespread infringement of copyrighted sound recordings. The lawsuit sought to bar the companies from training on copyrighted music, as well as damages of up to $150,000 per work from infringements that have already taken place. [17] [18]
Music and artificial intelligence (AI) is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.
A sticker is a detailed illustration of a character that represents an emotion or action that is a mix of cartoons and Japanese smiley-like "emojis" sent through instant messaging platforms. They have more variety than emoticons and have a basis from internet "reaction face" culture due to their ability to portray body language with a facial reaction. Stickers are elaborate, character-driven emoticons and give people a lightweight means to communicate through kooky animations.
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
Artificial intelligence art is visual artwork created or enhanced through the use of artificial intelligence (AI) programs.
Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.
DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as prompts.
Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom.
Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.
A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.
ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses, and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.
Riffusion is a neural network, designed by Seth Forsgren and Hayk Martiros, that generates music using images of sound rather than audio. It was created as a fine-tuning of Stable Diffusion, an existing open-source model for generating images from text prompts, on spectrograms. This results in a model which uses text prompts to generate image files, which can be put through an inverse Fourier transform and converted into audio files. While these files are only several seconds long, the model can also use latent space between outputs to interpolate different files together. This is accomplished using a functionality of the Stable Diffusion model known as img2img.
ElevenLabs is a software company that specializes in developing natural-sounding speech synthesis software using deep learning.
Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.
The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include protein folding prediction led by Google DeepMind as well as large language models and generative AI applications developed by OpenAI.
In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models raised questions about whether copyright infringement occurs when such are trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there were several pending U.S. lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.
Suno AI, or simply Suno, is a generative artificial intelligence music creation program designed to generate realistic songs that combine vocals and instrumentation, or are purely instrumental. Suno has been widely available since December 20, 2023, after the launch of a web application and a partnership with Microsoft, which included Suno as a plugin in Microsoft Copilot.
Sora is a text-to-video model developed by OpenAI. The model generates short video clips based on user prompts, and can also extend existing short videos. Sora was released publicly for ChatGPT Plus and ChatGPT Pro users in December 2024.
"BBL Drizzy" is a "diss track beat" by American record producer Metro Boomin. It was released on May 5, 2024 in response to the Drake–Kendrick Lamar feud which consisted of multiple diss tracks from both sides. "BBL Drizzy" samples an artificial intelligence-generated track, released on April 14, of the same name by comedian King Willonius. It is the first notable example of AI sampling in mainstream hip-hop music, according to Billboard.
Flux is a text-to-image model developed by Black Forest Labs, based in Freiburg im Breisgau, Germany. Black Forest Labs was founded by former employees of Stability AI. As with other text-to-image models, Flux generates images from natural language descriptions, called prompts.