LAION

Last updated
LAION
Company type Non-profit
Industry Artificial intelligence
Founder
  • Christoph Schuhmann
  • Jenia Jitsev
  • Richard Vencu
  • Robert Kaczmarczyk
  • Theo Coombes
  • Mehdi Cherti
  • Aarush Katta
  • Jan Ebert
Website laion.ai   OOjs UI icon edit-ltr-progressive.svg

LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets. [1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen. [2] [3]

Contents

In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party. [4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set. [5]

On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot OpenAssistant.

Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions. [6] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves. [7]

The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021. [8] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset. [6] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets. [9]

A successor of more than 5 billion pairs, LAION-5B, was released in March 2022. [10] As of its release, it was the largest freely available dataset of image-caption pairs in existence. [6] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it. [11]

Criticism

Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. [12] [13]

An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data. [14]

In December 2023, the Stanford Internet Observatory released a report on LAION-5B that found 3,226 suspected instances of links to child sexual abuse material with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution". [15]

OpenAssistant

OpenAssistant
Developer(s) LAION and contributors
Initial release15 April 2023;14 months ago (2023-04-15)
Type
License Apache License 2.0
Website open-assistant.io

OpenAssistant is an artificial intelligence (AI) open source chat-based assistant that understands tasks, can interact with third-party systems and retrieve information dynamically to do so. The project is developed by a group of volunteers in collaboration with LAION. One of the goals for development includes free access to large language models that can be run locally on consumer hardware. [16] [17] The project is backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points. [17] [18]

Related Research Articles

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of an artificial intelligence (AI) program.

80 Million Tiny Images is a dataset intended for training machine learning systems. It contains 79,302,017 32×32 pixel color images, scaled down from images extracted from the World Wide Web in 2008 using automated web search queries on a set of 75,062 non-abstract nouns derived from WordNet. The words in the search terms were then used as labels for the images. The researchers used seven web search resources for this purpose: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

Abeba Birhane is an Ethiopian-born cognitive scientist who works at the intersection of complex adaptive systems, machine learning, algorithmic bias, and critical race studies. Birhane's work with Vinay Prabhu uncovered that large-scale image datasets commonly used to develop AI systems, including ImageNet and 80 Million Tiny Images, carried racist and misogynistic labels and offensive images. She has been recognized by VentureBeat as a top innovator in computer vision and named as one of the 100 most influential persons in AI 2023 by TIME magazine.

Wu Dao is a multimodal artificial intelligence developed by the Beijing Academy of Artificial Intelligence (BAAI). Wu Dao 1.0 was first announced on January 11, 2021; an improved version, Wu Dao 2.0, was announced on May 31. It has been compared to GPT-3, and is built on a similar architecture; in comparison, GPT-3 has 175 billion parameters — variables and inputs within the machine learning model — while Wu Dao has 1.75 trillion parameters. Wu Dao was trained on 4.9 terabytes of images and texts, while GPT-3 was trained on 45 terabytes of text data. Yet, a growing body of work highlights the importance of increasing both data and parameters. The chairman of BAAI said that Wu Dao was an attempt to "create the biggest, most powerful AI model possible"; although direct comparisons between models based on parameter count do not directly correlate to quality. Wu Dao 2.0, was called "the biggest language A.I. system yet". It was interpreted by commenters as an attempt to "compete with the United States".. Notably, the type of architecture used for Wu Dao 2.0 is a mixture-of-experts (MoE) model, unlike GPT-3, which is a "dense" model: while MoE models require much less computational power to train than dense models with the same numbers of parameters, trillion-parameter MoE models have shown comparable performance to models that are hundreds of times smaller.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Prisma Labs is a software company based in Sunnyvale, California that is known for developing Prisma and Lensa.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Confident unjustified claim by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI which contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with unjustified responses or beliefs rather than perceptual experiences.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3, released in April 2024.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

<span class="mw-page-title-main">Sora (text-to-video model)</span> Text-to-video model by OpenAI

Sora is an upcoming generative artificial intelligence model developed by OpenAI, that specializes in text-to-video generation. The model accepts textual descriptions, known as prompts, from users and generates short video clips corresponding to those descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world scenarios. When creating real-world scenarios, user input may be required to ensure factual accuracy, otherwise features can be added erroneously. Sora is praised for its ability to produce videos with high levels of visual detail, including intricate camera movements and characters that exhibit a range of emotions. Furthermore, the model possesses the functionality to extend existing short videos by generating new content that seamlessly precedes or follows the original clip. As of May 2024, it is unreleased and not yet available to the public.

References

  1. "About". LAION.ai. Retrieved 26 September 2022.
  2. Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica.
  3. Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database". Bloomberg News . Retrieved 24 April 2023.
  4. "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135". CourtListener. Retrieved 2023-02-08.
  5. "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead". Vice. 28 April 2023. Retrieved 2023-05-04.
  6. 1 2 3 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ.
  7. Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica.
  8. Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. Retrieved 26 September 2022.
  9. Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv: 2205.11487 [cs.CV].
  10. Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog.
  11. Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch.
  12. Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). "Multimodal datasets: misogyny, pornography, and malignant stereotypes". arXiv: 2110.01963 .{{cite journal}}: Cite journal requires |journal= (help)
  13. Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets, arXiv: 2311.03449
  14. Brunner, Katharina; Harlan, Elisa. "We Are All Raw Material for AI". Bayerischer Rundfunk.
  15. Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material". 404 Media. Retrieved 22 December 2023.
  16. Open-Assistant, LAION AI, 2023-03-09, retrieved 2023-03-09
  17. 1 2 Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv: 2304.07327 [cs.CL].
  18. "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development". KDnuggets. Retrieved 2023-05-05.