T5 (language model)

Last updated
Text-to-Text Transfer Transformer (T5)
Original author(s) Google AI
Initial release23 October 2019;5 years ago (23 October 2019)
Stable release
Repository https://github.com/google-research/text-to-text-transfer-transformer
Type
License Apache-2.0
Website blog.research.google/2020/02/exploring-transfer-learning-with-t5.html

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI introduced in 2019. [1] [2] Like the original Transformer model, [3] T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

Contents

T5 models are usually pretrained on a massive dataset of text and code, after which they can perform the text-based tasks that are similar to their pretrained tasks. They can also be finetuned to perform other tasks.

T5 models have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics. [4]

Training

The original T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of <input text> -> <output text>.

How a T5 can be finetuned for a summarization task. T5-finetune-summarization.svg
How a T5 can be finetuned for a summarization task.

Some examples are:

Architecture

T5 encoder-decoder structure, showing the attention structure. In the encoder self-attention (lower square), all input tokens attend to each other; In the encoder-decoder cross-attention (upper rectangle), each target token attends to all input tokens; In the decoder self-attention (upper triangle), each target token attends to present and past target tokens only (causal). T5 encoder-decoder structure.svg
T5 encoder-decoder structure, showing the attention structure. In the encoder self-attention (lower square), all input tokens attend to each other; In the encoder–decoder cross-attention (upper rectangle), each target token attends to all input tokens; In the decoder self-attention (upper triangle), each target token attends to present and past target tokens only (causal).

The T5 series encompasses several models with varying sizes and capabilities, all encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper [1] reported the following 5 models:

T5 properties [note 1]
NameTotal parametersEncoder parametersDecoder parameters
Small76,956,16035,330,81641,625,34465122048648
Base247,577,856109,628,544137,949,3121276830726412
Large770,567,168334,939,648435,627,52024102440966416
3B2,884,497,4081,240,909,8241,643,587,5842410241638412832
11B11,340,220,4164,864,791,5526,475,428,86424102465536128128

*The encoder and the decoder have the same shape. So for example, the T5-small has 6 layers in the encoder and 6 layers in the decoder.

In the above table,

Note that unlike typical Transformers, the 3B and 11B models do not satisfy . [6]

Compared to the original Transformer, it uses a few minor modifications: layer normalization with no additive bias; placing the layer normalization outside the residual path; relative positional embedding. [7]

For all experiments, they used a WordPiece tokenizer, with vocabulary size 32,000. The tokenizer is shared across both the input and output of each model. It was trained on a mixture of English, German, French, and Romanian data from the C4 dataset, at a ratio of 10:1:1:1.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X. [8]

Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted.

T5 v1.1 properties [note 2]
NameTotal parametersEncoder parametersDecoder parameters
Small76,961,15235,332,80041,628,35285121024646
Base247,577,856109,628,544137,949,3121276820486412
Large783,150,080341,231,104441,918,97624102428166416
XL2,849,757,1841,223,527,4241,626,229,76024204851206432
XXL11,135,332,3524,762,310,6566,373,021,696244096102406464

Applications

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply.

The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, Google Imagen [26] uses T5-XXL as text encoder, and the encoded text vectors are used as conditioning on a diffusion model. As another example, the AuraFlow diffusion model [27] uses Pile-T5-XL.

Related Research Articles

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Google Neural Machine Translation (GNMT) was a neural machine translation (NMT) system developed by Google and introduced in November 2016 that used an artificial neural network to increase fluency and accuracy in Google Translate. The neural network consisted of two main blocks, an encoder and a decoder, both of LSTM architecture with 8 1024-wide layers each and a simple 1-layer 1024-wide feedforward attention mechanism connecting them. The total number of parameters has been variously described as over 160 million, approximately 210 million, 278 million or 380 million. It used WordPiece tokenizer, and beam search decoding strategy. It ran on Tensor Processing Units.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

<span class="mw-page-title-main">Seq2seq</span> Family of machine learning approaches

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

<span class="mw-page-title-main">ELMo</span> Word embedding system

ELMo is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. It was created by researchers at the Allen Institute for Artificial Intelligence, and University of Washington and first released in February, 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings, trained on a corpus of about 30 million sentences and 1 billion words.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

<span class="mw-page-title-main">Contrastive Language-Image Pre-training</span> Technique in neural networks for learning joint representations of text and images

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

<span class="mw-page-title-main">Vision transformer</span> Machine learning model for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Hugging Face, Inc. is an American company incorporated under the Delaware General Corporation Law and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.

In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained neural network model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen". A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter-efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.

<span class="mw-page-title-main">Neural scaling law</span> Law in machine learning

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

<span class="mw-page-title-main">Attention Is All You Need</span> 2017 research paper by Google

"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

The XLNet was an autoregressive Transformer designed as an improvement over BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June, 2019, under the Apache 2.0 license. It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference.

The Latent Diffusion Model (LDM) is a diffusion model architecture developed by the CompVis group at LMU Munich.

References

  1. 1 2 3 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. arXiv: 1910.10683 . ISSN   1533-7928.
  2. 1 2 google-research/text-to-text-transfer-transformer, Google Research, 2024-08-21, retrieved 2024-08-21
  3. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  4. Jiang, Yunfan; Gupta, Agrim; Zhang, Zichen; Wang, Guanzhi; Dou, Yongqiang; Chen, Yanjun; Fei-Fei, Li; Anandkumar, Anima; Zhu, Yuke (2022-10-06). "VIMA: General Robot Manipulation with Multimodal Prompts". arXiv: 2210.03094 [cs.RO].
  5. 1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "11.9. Large-Scale Pretraining with Transformers". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3.
  6. "config.json · google-t5/t5-11b at main". huggingface.co. 2020-04-24. Retrieved 2024-09-17.
  7. Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018-04-12), Self-Attention with Relative Position Representations, arXiv: 1803.02155
  8. 1 2 "t5x/docs/models.md at main · google-research/t5x". GitHub. Retrieved 2024-08-05.
  9. Shazeer, Noam (2020-02-12), GLU Variants Improve Transformer, arXiv: 2002.05202 , retrieved 2024-10-16
  10. "config.json · google/t5-v1_1-xl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.
  11. "config.json · google/t5-v1_1-xxl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.
  12. Lester, Brian; Al-Rfou, Rami; Constant, Noah (2021-09-02), The Power of Scale for Parameter-Efficient Prompt Tuning, arXiv: 2104.08691
  13. Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16), Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, arXiv: 2101.03961
  14. "SwitchTransformers". huggingface.co. Retrieved 2024-08-05.
  15. Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17), Multitask Prompted Training Enables Zero-Shot Task Generalization, arXiv: 2110.08207
  16. "bigscience/T0 · Hugging Face". huggingface.co. 2024-03-04. Retrieved 2024-08-21.
  17. Xue, Linting; Barua, Aditya; Constant, Noah; Al-Rfou, Rami; Narang, Sharan; Kale, Mihir; Roberts, Adam; Raffel, Colin (2022-03-25). "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". Transactions of the Association for Computational Linguistics. 10: 291–306. arXiv: 2105.13626 . doi:10.1162/tacl_a_00461. ISSN   2307-387X.
  18. Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024). "Scaling Instruction-Finetuned Language Models". Journal of Machine Learning Research. 25 (70): 1–53. arXiv: 2210.11416 . ISSN   1533-7928.
  19. Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning". Proceedings of the 40th International Conference on Machine Learning. PMLR: 22631–22648. arXiv: 2301.13688 .
  20. 1 2 google-research/FLAN, Google Research, 2024-08-03, retrieved 2024-08-05
  21. "google/flan-t5-xl · Hugging Face". huggingface.co. 2024-01-04. Retrieved 2024-08-05.
  22. Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023). "Scaling Up Models and Data with t5x and seqio". Journal of Machine Learning Research. 24 (377): 1–8. ISSN   1533-7928.
  23. 1 2 Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, arXiv: 2205.05131
  24. "Training great LLMs entirely from ground up in the wilderness as a startup". Yi Tay. Retrieved 2024-10-18.
  25. Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15). "Pile-T5". EleutherAI Blog. Retrieved 2024-05-05.
  26. "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Retrieved 2024-08-23.
  27. "AuraFlow". huggingface.co. Retrieved 2024-08-23.

Notes

  1. importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["t5-small","t5-base","t5-large","t5-3b","t5-11b"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google-t5/{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel
  2. importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["small","base","large","xl","xxl"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google/t5-v1_1-{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel