T5 (language model)

Text-to-Text Transfer Transformer (T5)
Text-to-Text Transfer Transformer (T5)
Original author(s)	Google AI
Initial release	23 October 2019;5 years ago
Stable release	T5X github.com/google-research/t5x
Repository	https://github.com/google-research/text-to-text-transfer-transformer
Type	Large language model ; Transformer (deep learning architecture) ;
License	Apache-2.0
Website	blog.research.google/2020/02/exploring-transfer-learning-with-t5.html

Last updated August 03, 2025

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI introduced in 2019.^[1]^[2] Like the original Transformer model,^[3] T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

Training

The original T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of <input text> -> <output text>.

How a T5 can be finetuned for a summarization task. T5-finetune-summarization.svg — How a T5 can be finetuned for a summarization task.

Some examples are:

restoring corrupted text: Thank you <X> me to your party <Y> week. -> <X> for inviting <Y> last <Z>, where the <Z> means "end of output", and the <X> and <Y> denote blanks to be filled, called "sentinels" in the original report.
translation: translate English to German: That is good. -> Das ist gut..
judging the grammatical acceptability of a sentence (CoLA sentence): The course is jumping well. -> not acceptable .

Architecture

The T5 series encompasses several models with varying sizes and capabilities, all encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper^[1] reported the following 5 models:

T5 properties^{[note 1]}
Name	Total parameters	Encoder parameters	Decoder parameters	$n_{\text{layer}}$	$d_{\text{model}}$	$d_{\text{ff}}$	$d_{\text{kv}}$	$n_{\text{head}}$
Small	76,956,160	35,330,816	41,625,344	6	512	2048	64	8
Base	247,577,856	109,628,544	137,949,312	12	768	3072	64	12
Large	770,567,168	334,939,648	435,627,520	24	1024	4096	64	16
3B	2,884,497,408	1,240,909,824	1,643,587,584	24	1024	16384	128	32
11B	11,340,220,416	4,864,791,552	6,475,428,864	24	1024	65536	128	128

^{*The encoder and the decoder have the same shape. So for example, the T5-small has 6 layers in the encoder and 6 layers in the decoder.}

In the above table,

$n_{\text{layer}}$ : Number of layers in the encoder; also, number of layers in the decoder. They always have the same number of layers.
$n_{\text{head}}$ : Number of attention heads in each attention block.
$d_{\text{model}}$ : Dimension of the embedding vectors.
$d_{\text{ff}}$ : Dimension of the feedforward network within each encoder and decoder layer.
$d_{\text{kv}}$ : Dimension of the key and value vectors used in the self-attention mechanism.

Note that unlike typical Transformers, the 3B and 11B models do not satisfy $d_{\text{model}}=d_{\text{kv}}n_{\text{head}}$ .^[6]

Compared to the original Transformer, it uses a few minor modifications: layer normalization with no additive bias; placing the layer normalization outside the residual path; relative positional embedding.^[7]

For all experiments, they used a WordPiece tokenizer, with vocabulary size 32,000. The tokenizer is shared across both the input and output of each model. It was trained on a mixture of English, German, French, and Romanian data from the C4 dataset, at a ratio of 10:1:1:1.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X.^[8]

Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted.

T5 small, base, large, 3B, 11B (2019): The original models.^[1]
T5 1.1 small, base, large, XL, XXL: Improved versions of the original T5 series. These have roughly equal parameters. The activation function is GEGLU^[9] instead of ReLU. The 3B and the 11B were changed to "XL" and "XXL", and their shapes are changed:^[8]^[10]^[11]

T5 v1.1 properties^{[note 2]}
Name	Total parameters	Encoder parameters	Decoder parameters	$n_{\text{layer}}$	$d_{\text{model}}$	$d_{\text{ff}}$	$d_{\text{kv}}$	$n_{\text{head}}$
Small	76,961,152	35,332,800	41,628,352	8	512	1024	64	6
Base	247,577,856	109,628,544	137,949,312	12	768	2048	64	12
Large	783,150,080	341,231,104	441,918,976	24	1024	2816	64	16
XL	2,849,757,184	1,223,527,424	1,626,229,760	24	2048	5120	64	32
XXL	11,135,332,352	4,762,310,656	6,373,021,696	24	4096	10240	64	64

LM-adapted T5 (2021): a series of models (from small to XXL) that started from checkpoints of the T5 series, but trained further on 100B additional tokens from C4.^[12]
Switch Transformer (2021): a mixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers.^[13]^[14]
T0 3B, 11B (2021): a series of models that started from checkpoints of LM-adapted T5, and further trained to perform tasks based only on task instruction (zero-shot).^[15] Different entries in the series uses different finetuning data.^[16]
ByT5 (2021): a byte-level version of T5, trained on mC4 (multilingual C4) dataset.^[17] It operates on text encoded as UTF-8 bytes, without tokenizers.
Flan-T5-XL (2022): a model that started with a checkpoint of T5 XL, then instruction-tuned on the FLAN dataset.^[18]^[19]^[20]^[21]
T5X (2022): a JAX-based re-implementation of the original T5 codebase. It is not a model.^[22] The original T5 codebase was implemented in TensorFlow with MeshTF.^[2]
UL2 20B (2022): a model with the same architecture as the T5 series, but scaled up to 20B, and trained with "mixture of denoisers" objective on the C4.^[23] It was trained on a TPU cluster by accident, when a training run was left running accidentally for a month.^[24]
Flan-UL2 20B (2022): UL2 20B instruction-finetuned on the FLAN dataset.^[23]^[20]
Pile-T5 (2024): has the same architecture of T5, except it used the Llama tokenizer. It was trained on The Pile. It came in sizes of base, large, XL, XXL.^[25]

Applications

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply.

The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, Google Imagen ^[26] uses T5-XXL as text encoder, and the encoded text vectors are used as conditioning on a diffusion model. As another example, the AuraFlow diffusion model^[27] uses Pile-T5-XL.

References

1 2 3 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. arXiv: 1910.10683 . ISSN 1533-7928.
1 2 google-research/text-to-text-transfer-transformer, Google Research, 2024-08-21, retrieved 2024-08-21
↑ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
↑ Jiang, Yunfan; Gupta, Agrim; Zhang, Zichen; Wang, Guanzhi; Dou, Yongqiang; Chen, Yanjun; Fei-Fei, Li; Anandkumar, Anima; Zhu, Yuke (2022-10-06). "VIMA: General Robot Manipulation with Multimodal Prompts". arXiv: 2210.03094 [cs.RO].
1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "11.9. Large-Scale Pretraining with Transformers". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
↑ "config.json · google-t5/t5-11b at main". huggingface.co. 2020-04-24. Retrieved 2024-09-17.
↑ Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018-04-12), Self-Attention with Relative Position Representations, arXiv: 1803.02155
1 2 "t5x/docs/models.md at main · google-research/t5x". GitHub. Retrieved 2024-08-05.
↑ Shazeer, Noam (2020-02-12), GLU Variants Improve Transformer, arXiv: 2002.05202
↑ "config.json · google/t5-v1_1-xl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.
↑ "config.json · google/t5-v1_1-xxl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.
↑ Lester, Brian; Al-Rfou, Rami; Constant, Noah (2021-09-02), The Power of Scale for Parameter-Efficient Prompt Tuning, arXiv: 2104.08691
↑ Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16), Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, arXiv: 2101.03961
↑ "SwitchTransformers". huggingface.co. Retrieved 2024-08-05.
↑ Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17), Multitask Prompted Training Enables Zero-Shot Task Generalization, arXiv: 2110.08207
↑ "bigscience/T0 · Hugging Face". huggingface.co. 2024-03-04. Retrieved 2024-08-21.
↑ Xue, Linting; Barua, Aditya; Constant, Noah; Al-Rfou, Rami; Narang, Sharan; Kale, Mihir; Roberts, Adam; Raffel, Colin (2022-03-25). "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". Transactions of the Association for Computational Linguistics. 10: 291–306. arXiv: 2105.13626 . doi:10.1162/tacl_a_00461. ISSN 2307-387X.
↑ Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024). "Scaling Instruction-Finetuned Language Models". Journal of Machine Learning Research. 25 (70): 1–53. arXiv: 2210.11416 . ISSN 1533-7928.
↑ Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning". Proceedings of the 40th International Conference on Machine Learning. PMLR: 22631–22648. arXiv: 2301.13688 .
1 2 google-research/FLAN, Google Research, 2024-08-03, retrieved 2024-08-05
↑ "google/flan-t5-xl · Hugging Face". huggingface.co. 2024-01-04. Retrieved 2024-08-05.
↑ Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023). "Scaling Up Models and Data with t5x and seqio". Journal of Machine Learning Research. 24 (377): 1–8. ISSN 1533-7928.
1 2 Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, arXiv: 2205.05131
↑ "Training great LLMs entirely from ground up in the wilderness as a startup". Yi Tay. Retrieved 2024-10-18.
↑ Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15). "Pile-T5". EleutherAI Blog. Retrieved 2024-05-05.
↑ "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Retrieved 2024-08-23.
↑ "AuraFlow". huggingface.co. Retrieved 2024-08-23.

External links

"T5 release - a google Collection". huggingface.co. 2024-07-31. Retrieved 2024-10-16.

Notes

↑

importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["t5-small","t5-base","t5-large","t5-3b","t5-11b"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google-t5/{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel

↑

importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["small","base","large","xl","xxl"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google/t5-v1_1-{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. arXiv: 1910.10683 . ISSN 1533-7928.

[:6-2] 1 2 google-research/text-to-text-transfer-transformer, Google Research, 2024-08-21, retrieved 2024-08-21

[3] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[4] Jiang, Yunfan; Gupta, Agrim; Zhang, Zichen; Wang, Guanzhi; Dou, Yongqiang; Chen, Yanjun; Fei-Fei, Li; Anandkumar, Anima; Zhu, Yuke (2022-10-06). "VIMA: General Robot Manipulation with Multimodal Prompts". arXiv: 2210.03094 [cs.RO].

[:7-5] 1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "11.9. Large-Scale Pretraining with Transformers". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[7] "config.json · google-t5/t5-11b at main". huggingface.co. 2020-04-24. Retrieved 2024-09-17.

[8] Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018-04-12), Self-Attention with Relative Position Representations, arXiv: 1803.02155

[:5-9] 1 2 "t5x/docs/models.md at main · google-research/t5x". GitHub. Retrieved 2024-08-05.

[10] Shazeer, Noam (2020-02-12), GLU Variants Improve Transformer, arXiv: 2002.05202

[11] "config.json · google/t5-v1_1-xl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.

[12] "config.json · google/t5-v1_1-xxl at main". huggingface.co. 2020-11-19. Retrieved 2024-09-17.

[14] Lester, Brian; Al-Rfou, Rami; Constant, Noah (2021-09-02), The Power of Scale for Parameter-Efficient Prompt Tuning, arXiv: 2104.08691

[15] Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16), Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, arXiv: 2101.03961

[16] "SwitchTransformers". huggingface.co. Retrieved 2024-08-05.

[17] Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17), Multitask Prompted Training Enables Zero-Shot Task Generalization, arXiv: 2110.08207

[18] "bigscience/T0 · Hugging Face". huggingface.co. 2024-03-04. Retrieved 2024-08-21.

[19] Xue, Linting; Barua, Aditya; Constant, Noah; Al-Rfou, Rami; Narang, Sharan; Kale, Mihir; Roberts, Adam; Raffel, Colin (2022-03-25). "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". Transactions of the Association for Computational Linguistics. 10: 291–306. arXiv: 2105.13626 . doi:10.1162/tacl_a_00461. ISSN 2307-387X.

[20] Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024). "Scaling Instruction-Finetuned Language Models". Journal of Machine Learning Research. 25 (70): 1–53. arXiv: 2210.11416 . ISSN 1533-7928.

[21] Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning". Proceedings of the 40th International Conference on Machine Learning. PMLR: 22631–22648. arXiv: 2301.13688 .

[:2-22] 1 2 google-research/FLAN, Google Research, 2024-08-03, retrieved 2024-08-05

[23] "google/flan-t5-xl · Hugging Face". huggingface.co. 2024-01-04. Retrieved 2024-08-05.

[:1-24] Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023). "Scaling Up Models and Data with t5x and seqio". Journal of Machine Learning Research. 24 (377): 1–8. ISSN 1533-7928.

[:3-25] 1 2 Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, arXiv: 2205.05131

[26] "Training great LLMs entirely from ground up in the wilderness as a startup". Yi Tay. Retrieved 2024-10-18.

[:4-27] Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15). "Pile-T5". EleutherAI Blog. Retrieved 2024-05-05.

[28] "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Retrieved 2024-08-23.

[29] "AuraFlow". huggingface.co. Retrieved 2024-08-23.

[6] 
importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["t5-small","t5-base","t5-large","t5-3b","t5-11b"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google-t5/{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel

[13] 
importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["small","base","large","xl","xxl"]:print(f"Model: {name}")config=AutoConfig.from_pretrained(f"google/t5-v1_1-{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in {name}: {total}")print(f"Total number of parameters in encoder: {enc}")print(f"Total number of parameters in decoder: {dec}")delmodel

[1]

[2]

[3]

[4]

[note 1]

[6]

[7]

[8]

[9]

[10]

[11]

[note 2]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]