BLOOM (language model)

Last updated

BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) [1] [2] is a 176-billion-parameter transformer-based autoregressive large language model (LLM). The model, as well as the code base and the data used to train it, are distributed under free licences. [3] BLOOM was trained on approximately 366 billion (1.6TB) tokens from March to July 2022. [4] [5]

BLOOM is the main outcome of the BigScience collaborative initiative, [6] a one-year-long research workshop that took place between May 2021 and May 2022. BigScience was led by HuggingFace and involved several hundreds of researchers and engineers from France and abroad representing both the academia and the private sector. BigScience was supported by a large-scale public compute grant on the French public supercomputer Jean Zay, managed by GENCI and IDRIS (CNRS), on which it was trained.

BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. It encompasses 46 natural languages (in amounts ranging from 30% of the whole dataset for English to 0.00002% for Chi Tumbuka) and 13 programming languages. [7]

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

In bilingual education, students are taught in two languages. It is distinct from learning a second language as a subject because both languages are used for instruction in different content areas like math, science, and history. The time spent in each language depends on the model. For example, some models focus on providing education in both languages throughout a student's entire education while others gradually transition to education in only one language. The ultimate goal of bilingual education is fluency and literacy in both languages through a variety of strategies such as translanguaging and recasting.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Plant ontology (PO) is a collection of ontologies developed by the Plant Ontology Consortium. These ontologies describe anatomical structures and growth and developmental stages across Viridiplantae. The PO is intended for multiple applications, including genetics, genomics, phenomics, and development, taxonomy and systematics, semantic applications and education.

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

<span class="mw-page-title-main">Internet linguistics</span> Domain of linguistics

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

MedSLT is a medium-ranged open source spoken language translator developed by the University of Geneva. It is funded by the Swiss National Science Foundation. The system has been designed for the medical domain. It currently covers the doctor-patient diagnosis dialogues for the domains of headache, chest and abdominal pain in English, French, Japanese, Spanish, Catalan and Arabic. The vocabulary used ranges from 350 to 1000 words depending on the domain and language pair.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Jeremy Farrar</span> British medical researcher

Sir Jeremy James Farrar is a British medical researcher who has served as Chief Scientist at the World Health Organization since 2023. He was previously the director of The Wellcome Trust from 2013 to 2023 and a professor of tropical medicine at the University of Oxford.

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism was proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

Hugging Face, Inc. is a French-American company based in New York City that develops computer tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

Patricia Ana Matrai is a marine scientist known for her work on the cycling of sulfur. She is a senior research scientist at Bigelow Laboratory for Ocean Sciences.

<span class="mw-page-title-main">EleutherAI</span> Artificial intelligence research collective

EleutherAI is a grass-roots non-profit artificial intelligence (AI) research group. The group, considered an open-source version of OpenAI, was formed in a Discord server in July 2020 to organize a replication of GPT-3. In early 2023, it formally incorporated as the EleutherAI Foundation, a non-profit research institute.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and understanding. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba.

References

  1. "BigScience Large Open-science Open-access Multilingual Language Model" . Retrieved 2022-10-01.
  2. Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni A, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Sasanka Ammanamanchi P, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, Ruwase O, Bawden R, Bekman S, McMillan-Major A, Beltagy I, Nguyen H, Saulnier L, Tan S, Ortiz Suarez P, Sanh V, Laurençon H, Jernite Y, Launay J, Mitchell M, Raffel C, et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model". arXiv: 2211.05100 .
  3. "The BigScience RAIL license" . Retrieved 2024-01-10.
  4. Heikkilä, Melissa (2022-07-12). "BLOOM: Inside the radical new project to democratize AI". MIT Technology Review . Retrieved 2023-12-26.
  5. "Release of largest trained open-science multilingual language model ever". French National Centre for Scientific Research . 2022-07-12. Retrieved 2023-12-26.
  6. "BigScience" . Retrieved 2024-01-10.
  7. Laurençon H, Saulnier L, Wang T, Akiki C, Villanova del Moral A, Le Scao T, Von Werra L, Mou C, González Ponferrada C, Nguyen H, Frohberg J, Šaško M, Lhoest Q, McMillan-Major A, Dupont G, Biderman S, Rogers A, Ben allal L, De Toni F, Pistilli G, Nguyen O, Nikpoor S, Masoud M, Colombo P, de la Rosa J, Villegas P, Thrush T, Longpre S, Nagel S, Weber L, Muñoz M, Zhu J, Van Strien D, Alyafeai Z, Almubarak K, Vu MC, Gonzalez-Dios I, Soroa A, Lo K, Dey M, Ortiz Suarez P, Gokaslan A, Bose S, Adelani D, Phan L, Tran H, Yu I, Pai S, Chim J, Lepercq V, Ilic S, Mitchell M, Luccioni S, Jernite Y (2022). "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset". arXiv: 2303.03915 .