The Pile (dataset)

Last updated
The Pile
Size886.03 GB
TypeOpen-source
LanguageEnglish
Creator(s) EleutherAI
Date of releaseDecember 31, 2020;5 years ago (2020-12-31)
Main application(s)Training large language models

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 component sub-datasets. [1] The Pile and Common Crawl had been, as of 2024, the two main training datasets being used to train AI models. [2] [3]

Contents

Copyright disputes centering around use of The Pile escalated in 2023, prompting Eleuther to start removing some datasets. Eleuther partnered with various organizations to release Common Pile v0.1 in 2025 in order to have a large curated training dataset without the copyright issues.

Training on copyrighted works or derivatives

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. [4] [5] In July 2023, the Danish anti-piracy group Rights Alliance took down Books3 through DMCA notices. [1] Books3 was removed from the Pile before a class action lawsuit was filed in 2024 by three authors seeking damages as copies of the original dataset were still copied and available on the web. [5] By 2024, The Pile also was taken down from its original site, though was accessible from other file sharing services. [2]

OpenSubtitles is another dataset used in the Pile that created controversy over the use of copyrighted works, this time from documentaries, movies, television and online videos. [6]

Tens of thousands of YouTube videos had their subtitles scraped directly from YouTube and included in the Pile, which YouTube argued is against its terms of service. [7] [2]

Sub-datasets of The Pile [8] [9] [ better source needed ](*denotes works with copyright disputes)
ComponentOriginal size, GB
Pile-CC243.87
PubMed Central*96.93
Books3108.40
OpenWebText2*67.40
arXiv*60.36
GitHub*102.18
Free Law*54.92
Stack Exchange*34.57
USPTO Backgrounds*24.59
PubMed Abstracts*20.68
Gutenberg (PG-19) 11.68
OpenSubtitles 13.94
Wikipedia [2] 6.85
DeepMind Mathematics8.32
Ubuntu Freenode IRC logs*5.93
BookCorpus2*6.76
EuroParl [2] 4.93
Hacker News*4.19
YouTube Subtitles* [2] 4.01
PhilPapers*2.56
NIH ExPorter*2.03
Enron Emails [2] 0.95
Total886.03

Common Pile v0.1

In June 2025, EleutherAI, in partnership with the Poolside, Hugging Face, and the US Library of Congress and over two dozen researchers at 14 institutions including the University of Toronto, MIT, CMU, the Vector Institute and the Allen Institute for AI released Common Pile v0.1, a training dataset that contains only works where the licenses permit their use for training AI models. [10] [11] [12] The intent is to show what is possible if ethically training AI systems while respecting copyrighted works. [12] They found that the process of gathering the data could not be fully automated and was at times painstaking, with humans verifying and annotating every entry, and that resulting models could achieve impressive results even though they were still not comparable with frontier models. [12]

See also

References

  1. 1 2 Barr, Kyle (2023-08-18). "Anti-Piracy Group Takes Massive AI Training Dataset 'Books3′ Offline". Gizmodo. Retrieved 2025-12-10.
  2. 1 2 3 4 5 6 7 Gilbertson, Annie. "Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI". Wired. ISSN   1059-1028 . Retrieved 2026-01-29.
  3. Anthony, Aubra; Sharma, Lakshmee; Noor, Elina (2024-04-30). "Advancing a More Global Agenda for Trustworthy Artificial Intelligence". Carnegie Endowment for International Peace. Retrieved 2026-01-29.
  4. Knibbs, Kate (September 4, 2023). "The Battle Over Books3 Could Change AI Forever". WIRED . Retrieved 13 October 2023.
  5. 1 2 Roth, Emma (2024-08-20). "Authors sue Anthropic for training AI using pirated books". The Verge. Retrieved 2025-12-10.
  6. Deck, Andrew (January 7, 2025). "Thousands of documentaries are fueling AI models built by Apple, Meta, and Nvidia". Nieman Lab. Archived from the original on 2025-07-01. Retrieved 2025-12-10.
  7. Sato, Mia (2024-07-16). "Apple, Anthropic, and other companies used YouTube videos to train AI". The Verge. Retrieved 2025-12-10.
  8. Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv: 2101.00027 [cs.CL].
  9. "The Pile Replication Code". github.com. 15 June 2021. Retrieved 29 October 2024.
  10. Wiggers, Kyle (2025-06-06). "EleutherAI releases massive AI training dataset of licensed and open domain text". TechCrunch. Retrieved 2025-12-10.
  11. Eriksson, Viktor (June 9, 2025). "Eleuther AI releases 8TB collection of licensed and open training data". Computerworld . Retrieved 2025-12-10.
  12. 1 2 3 Tiku, Nitasha; Jiménez, Andrea; Oremus, Will (June 5, 2025). "Analysis: AI firms say they can't respect copyright. These researchers tried". Washington Post.