Jamba (language model)

Last updated
Jamba
Developer(s) AI21 Labs
Initial release29 March 2024
Type
License Apache 2.0 License

Jamba is an open-weights large language model (LLM) developed by AI21 Labs. [1] [2] [3] It utilizes a Mamba-based model built on a novel state space model (SSM) and transformer hybrid architecture. [4] [1] [5] It is a 52 billion parameter model trained using a mixture-of-experts (MoE) technique with 12B active parameters (number of parameters active per token). [2] [1] Jamba can fit up to 256K tokens in its context window and is the largest Mamba-variant LLM created. [2] [4]

Contents

Jamba performs well across a number of key measurements including throughput and efficiency while outperforming or matching other state-of-the-art models in its class on a wide range of performance benchmarks while having significantly greater context limits enabling use-cases that require increased context. [1] [2] The model is released with open weights under an Apache 2.0 license [6] [5]

The company plans to release a beta-version instruct-tuned version on the AI21 Platform in the near future [7]

Characteristics

See also

Related Research Articles

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required. The model is then trained to able to understand and work with multiple forms of data.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

Wu Dao is a multimodal artificial intelligence developed by the Beijing Academy of Artificial Intelligence (BAAI). Wu Dao 1.0 was first announced on January 11, 2021; an improved version, Wu Dao 2.0, was announced on May 31. It has been compared to GPT-3, and is built on a similar architecture; in comparison, GPT-3 has 175 billion parameters — variables and inputs within the machine learning model — while Wu Dao has 1.75 trillion parameters. Wu Dao was trained on 4.9 terabytes of images and texts, while GPT-3 was trained on 45 terabytes of text data. Yet, a growing body of work highlights the importance of increasing both data and parameters. The chairman of BAAI said that Wu Dao was an attempt to "create the biggest, most powerful AI model possible"; although direct comparisons between models based on parameter count do not directly correlate to quality. Wu Dao 2.0, was called "the biggest language A.I. system yet". It was interpreted by commenters as an attempt to "compete with the United States".. Notably, the type of architecture used for Wu Dao 2.0 is a mixture-of-experts (MoE) model, unlike GPT-3, which is a "dense" model: while MoE models require much less computational power to train than dense models with the same numbers of parameters, trillion-parameter MoE models have shown comparable performance to models that are hundreds of times smaller.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">AI21 Labs</span> Tel Aviv-based company

AI21 Labs is an Israeli company specializing in Natural Language Processing (NLP), which develops AI systems that can understand and generate natural language.

<span class="mw-page-title-main">GPT-J</span> Open source artificial intelligence text generating language model developed by EleutherAI

GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.

LLaMA is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

<span class="mw-page-title-main">PaLM</span> Large language model developed by Google

PaLM is a 540 billion parameter transformer-based large language model developed by Google AI. Researchers also trained smaller versions of PaLM, 8 and 62 billion parameter models, to test the effects of model scale.

Wordtune is an AI powered reading and writing companion capable of fixing grammatical errors, understanding context and meaning, suggesting paraphrases or alternative writing tones, and generating written text based on context. It is developed by the Israeli AI company AI21 Labs.

<span class="mw-page-title-main">Gemini (language model)</span> Large language model developed by Google

Gemini is a family of multimodal large language models developed by Google DeepMind, serving as the successor to LaMDA and PaLM 2. Comprising Gemini Ultra, Gemini Pro, and Gemini Nano, it was announced on December 6, 2023, positioned as a competitor to OpenAI's GPT-4. It powers the generative artificial intelligence chatbot of the same name.

Mistral AI is a French company selling artificial intelligence (AI) products. It was founded in April 2023 by previous employees of Meta Platforms and Google DeepMind. The company raised €385 million in October 2023 and in December 2023 it was valued at more than $2 billion.

Mamba is a deep learning architecture focused on sequence modeling. It was developed by researchers from Carnegie Mellon University and Princeton University to address some limitations of transformer models, especially in processing long sequences. It is based on the Structured State Space sequence (S4) model.

Claude is a family of large language models developed by Anthropic. The first model was released in March 2023. Claude 3, released in March 2024, can also analyze images.

DBRX is an open-sourced large language model (LLM) developed by Mosaic ML team at Databricks. It was trained on 3,072 Nvidia H100s connected by 3.2 terrabytes per second bandwidth (InfiniBand). The LLM has 132 billion parameters trained using a Mixture-of-Experts approach with 36 billion "active" parameters meaning only a subset of the network is active for each token.

References

  1. 1 2 3 4 "Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model". www.ai21.com. Retrieved 2024-03-29.
  2. 1 2 3 4 Kerner, Sean Michael (2024-03-28). "AI21 Labs juices up gen AI transformers with Jamba". VentureBeat. Retrieved 2024-03-29.
  3. Mawira, Benson (March 28, 2024). "Next-Generation AI System Promises Unprecedented Scalability". Cryptopolitan.
  4. 1 2 "AI21 Labs' Jamba infuses Mamba to bring more context to transformer-based LLMs". SiliconANGLE. 2024-03-28. Retrieved 2024-03-29.
  5. 1 2 "MLTimes - Time To Learn AI". mltimes.se. Retrieved 2024-03-29.
  6. AI21. "Unveiling Jamba: AI21's Groundbreaking Hybrid SSM-Transformer Open-Source Model". www.prnewswire.com. Retrieved 2024-03-29.{{cite web}}: CS1 maint: numeric names: authors list (link)
  7. 1 2 3 4 "AI21 Labs enhances the capabilities of gen AI transformers through Jamba integration". Global Village Space | Technology. 2024-03-28. Retrieved 2024-03-29.