The Pile (dataset)

Last updated
The Pile
Size886.03 GB
TypeOpen-source
LanguageEnglish
Creator(s) EleutherAI
Date of releaseDecember 31, 2020;5 years ago (2020-12-31)
Main application(s)Training large language models

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. [1] [2] It is composed of 22 component sub-datasets. [3]

Contents

Creation

Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the Common Crawl. [4] However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. [5] The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. [1] Compared to other datasets as of 2022, the Pile's main distinguishing features were that it was a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it was the only such dataset that was thoroughly documented by the researchers who developed it. [6]

Contents and filtering

Machine learning algorithms do not learn all they can from data on the first pass, so it is common practice to train an AI model on the same data more than once with each pass through the entire dataset referred to as an "epoch". [7] To indicate variation in the perceived quality of the data between the 22 sub-datasets that make up the Pile, each of them was assigned a different number of "epochs", which would affect relative frequencies at which samples would be drawn from each individual sub-dataset. [1] The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Sizes have been converted to GB, and asterisks are used to indicate the newly introduced datasets.

Sub-datasets of the Pile [1] [8]
ComponentOriginal size, GBEpochsEffective size, GB
Pile-CC243.871243.87
PubMed Central*96.932193.86
Books3108.401.5162.61
OpenWebText2*67.402134.80
arXiv*60.362120.71
GitHub*102.181102.18
Free Law*54.921.582.39
Stack Exchange*34.57269.14
USPTO Backgrounds*24.59249.19
PubMed Abstracts*20.68241.37
Gutenberg (PG-19) 11.682.529.20
OpenSubtitles 13.941.520.91
Wikipedia 6.85320.54
DeepMind Mathematics8.32216.63
Ubuntu Freenode IRC logs*5.93211.84
BookCorpus2*6.761.510.15
EuroParl 4.9329.85
Hacker News*4.1928.38
YouTube Subtitles*4.0128.02
PhilPapers*2.5625.11
NIH ExPorter*2.0324.07
Enron Emails 0.9521.89
Total886.031346.69

EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with. [1]

All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links. [1]

Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content. [1]

Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However, EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards. [1]

Use

The Pile was originally developed to train EleutherAI's GPT-Neo models [9] [10] [11] [ better source needed ] but has become widely used to train other models, including Microsoft's Megatron-Turing Natural Language Generation, [12] [13] Meta AI's Open Pre-trained Transformers, [14] LLaMA, [15] and Galactica, [16] Stanford University's BioMedLM 2.7B, [17] the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL, [18] Yandex's YaLM 100B, [19] and Apple's OpenELM. [20]

In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles. [2] [21] [22]

Training on copyrighted works or derivatives

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. [23] [24] In July 2023, the Danish anti-piracy group Rights Alliance took copies of the Pile down through DMCA notices. [3] Books3 was removed from the Pile before a class action lawsuit was filed in 2024 by three authors seeking damages as copies of the original dataset were still copied and available on the web. [24]

OpenSubtitles is another dataset used in the Pile that created controversy over the use of copyrighted works, this time from documentaries, movies, television and online videos. [25]

Tens of thousands of YouTube videos had their subtitles scraped directly from YouTube and included in the Pile, which YouTube argued is against its terms of service. [26]

Common Pile v0.1

In June 2025, EleutherAI, in partnership with the Poolside, Hugging Face, and the US Library of Congress and over two dozen researchers at 14 institutions including the University of Toronto, MIT, CMU, the Vector Institute and the Allen Institute for AI released Common Pile v0.1, a training dataset that contains only works where the licenses permit their use for training AI models. [27] [28] [29] The intent is to show what is possible if ethically training AI systems while respecting copyrighted works. [29] They found that the process of gathering the data could not be fully automated and was at times painstaking, with humans verifying and annotating every entry, and that resulting models could achieve impressive results even though they were still not comparable with frontier models. [29]

See also

References

  1. 1 2 3 4 5 6 7 8 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv: 2101.00027 [cs.CL].
  2. 1 2 "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". EleutherAI Website. EleutherAI. 13 February 2020. Archived from the original on 28 February 2023. Retrieved 4 June 2023.
  3. 1 2 Barr, Kyle (2023-08-18). "Anti-Piracy Group Takes Massive AI Training Dataset 'Books3′ Offline". Gizmodo. Retrieved 2025-12-10.
  4. "Language Models are Few-Shot Learners" (PDF). NeurIPS. 2020.
  5. Rosset, Corby (13 February 2020). "Turing-NLG: A 17-billion-parameter language model by Microsoft". Microsoft Blog. Microsoft . Retrieved 31 December 2020.
  6. Khan, Mehtab; Hanna, Alex (13 September 2022). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability". SSRN   4217148. p. 20.
  7. Brownlee, Jason (10 August 2022). "Difference Between a Batch and an Epoch in a Neural Network". Archived from the original on 20 June 2019. Retrieved 2 June 2023 via machinelearningmastery.com.
  8. "The Pile Replication Code". github.com. 15 June 2021. Retrieved 29 October 2024.
  9. "GPT-Neo 125M". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  10. "GPT-Neo 1.3B". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  11. "GPT-Neo 2.7B". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  12. Wiggers, Kyle (11 October 2021). "Microsoft and Nvidia team up to train one of the world's largest language models". VentureBeat. Archived from the original on 27 March 2023. Retrieved 8 March 2023.
  13. "AI: Megatron the Transformer, and its related language models". 24 September 2021. Archived from the original on 4 March 2023. Retrieved 8 March 2023.
  14. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv: 2205.01068 [cs.CL].
  15. Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Grave, Edouard; Lample, Guillaume; et al. (27 February 2023). "LLaMA: Open and Efficient Foundation Language Models". arXiv: 2302.13971 [cs.CL].
  16. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science". arXiv: 2211.09085 [cs.CL].
  17. "Model Card for BioMedLM 2.7B". huggingface.co. Archived from the original on 5 June 2023. Retrieved 5 June 2023.
  18. Yuan, Sha; Zhao, Hanyu; Du, Zhengxiao; Ding, Ming; Liu, Xiao; Cen, Yukuo; Zou, Xu; Yang, Zhilin; Tang, Jie (2021). "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models". AI Open. 2: 65–68. doi: 10.1016/j.aiopen.2021.06.001 .
  19. Grabovskiy, Ilya (2022). "Yandex publishes YaLM 100B, the largest GPT-like neural network in open source" (Press release). Yandex. Retrieved 5 June 2023.
  20. Mehta, Sachin; Sekhavat, Mohammad Hossein; Cao, Qingqing; Horton, Maxwell; Jin, Yanzi; Sun, Chenfan; Mirzadeh, Iman; Najibi, Mahyar; Belenko, Dmitry (2024-05-01). "OpenELM: An Efficient Language Model Family with Open Training and Inference Framework". arXiv: 2404.14619 [cs.CL].
  21. Rae, Jack W; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; et al. (21 Jan 2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv: 2112.11446 [cs.CL].
  22. Lieber, Opher; Sharir, Or; Lenz, Barak; Shoham, Yoav (1 August 2021). "Jurassic-1: Technical Details and Evaluation" (PDF). AI21 Labs. Retrieved 5 June 2023.
  23. Knibbs, Kate (September 4, 2023). "The Battle Over Books3 Could Change AI Forever". WIRED . Retrieved 13 October 2023.
  24. 1 2 Roth, Emma (2024-08-20). "Authors sue Anthropic for training AI using pirated books". The Verge. Retrieved 2025-12-10.
  25. Deck, Andrew (January 7, 2025). "Thousands of documentaries are fueling AI models built by Apple, Meta, and Nvidia". Nieman Lab. Archived from the original on 2025-07-01. Retrieved 2025-12-10.
  26. Sato, Mia (2024-07-16). "Apple, Anthropic, and other companies used YouTube videos to train AI". The Verge. Retrieved 2025-12-10.
  27. Wiggers, Kyle (2025-06-06). "EleutherAI releases massive AI training dataset of licensed and open domain text". TechCrunch. Retrieved 2025-12-10.
  28. Eriksson, Viktor (June 9, 2025). "Eleuther AI releases 8TB collection of licensed and open training data". Computerworld . Retrieved 2025-12-10.
  29. 1 2 3 Tiku, Nitasha; Jiménez, Andrea; Oremus, Will (June 5, 2025). "Analysis: AI firms say they can't respect copyright. These researchers tried". Washington Post.