The Pile (dataset)

Last updated

The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. [1] [2] It is composed of 22 smaller datasets, including 14 new ones. [1]

Contents

Creation

Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the Common Crawl. [3] However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. [4] The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. [1] [5] Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it. [6]

Contents and filtering

Artificial Intelligences do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch". [7] Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to the perceived quality of the data. [1] The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been converted to GB, and asterisks are used to indicate the newly introduced datasets.

Sub-datasets of the Pile [1] [5]
ComponentOriginal SizeEpochsEffective Size
Pile-CC243.87 GB1243.87 GB
PubMed Central*96.93 GB2193.86 GB
Books3108.40 GB1.5162.61 GB
OpenWebText2*67.40 GB2134.80 GB
arXiv*60.36 GB2120.71 GB
GitHub*102.18 GB1102.18 GB
Free Law*54.92 GB1.582.39 GB
Stack Exchange*34.57 GB269.14 GB
USPTO Backgrounds*24.59 GB249.19 GB
PubMed Abstracts*20.68 GB241.37 GB
Gutenberg (PG-19) 11.68 GB2.529.20 GB
OpenSubtitles13.94 GB1.520.91 GB
Wikipedia 6.85 GB320.54 GB
DeepMind Mathematics8.32 GB216.63 GB
Ubuntu Freenode IRC logs*5.93 GB211.84 GB
BookCorpus2*6.76 GB1.510.15 GB
EuroParl 4.93 GB29.85 GB
Hacker News*4.19 GB28.38 GB
YouTube Subtitles*4.01 GB28.02 GB
PhilPapers*2.56 GB25.11 GB
NIH ExPorter*2.03 GB24.07 GB
Enron Emails 0.95 GB21.89 GB
Total886.03 GB1346.69 GB

EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with. [1]

All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links. [1]

Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content. [1]

Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However, EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards. [1]

Use

The Pile was originally developed to train EleutherAI's GPT-Neo models [8] [9] [10] but has become widely used to train other models, including Microsoft's Megatron-Turing Natural Language Generation, [11] [12] Meta AI's Open Pre-trained Transformers, [13] LLaMA, [14] and Galactica, [15] Stanford University's BioMedLM 2.7B, [16] the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL, [17] and Yandex's YaLM 100B. [18]

In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles. [2] [19] [20]

DMCA takedown

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. [21] In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. [22] [23] Users responded by creating copies of The Pile with the offending content removed. [24]

See also

Related Research Articles

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts."

<span class="mw-page-title-main">GPT-1</span> 2018 text-generating language model

Generative Pre-trained Transformer 1 (GPT-1) was the first of OpenAI's large language models following Google's invention of the transformer architecture in 2017. In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

A foundation model is a machine learning or Deep learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT. The Stanford Institute for Human-Centered Artificial Intelligence's Center for Research on Foundation Models created and popularized the term.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI boom.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">GPT-J</span> Open source artificial intelligence text generating language model developed by EleutherAI

GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.

<span class="mw-page-title-main">EleutherAI</span> Artificial intelligence research collective

EleutherAI is a grass-roots non-profit artificial intelligence (AI) research group. The group, considered an open-source version of OpenAI, was formed in a Discord server in July 2020 to organize a replication of GPT-3. In early 2023, it formally incorporated as the EleutherAI Foundation, a non-profit research institute.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

LLaMA is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

<span class="mw-page-title-main">PaLM</span> Large language model developed by Google

PaLM is a 540 billion parameter transformer-based large language model developed by Google AI. Researchers also trained smaller versions of PaLM, 8 and 62 billion parameter models, to test the effects of model scale.

Stochastic parrot is a term in the field of machine learning used as a metaphor for large language models (LLMs). It is used to describe the theory that though LLMs are able to generate plausible language, they do not understand the meaning of the language they process. The term stochastic parrot was coined by University of Washington researcher Emily M. Bender in her paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜", co-authored with Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. There is currently no consensus on whether LLMs are stochastic parrots, with surveys showing AI researchers are near split on the issue, but people have continued to use this term to describe such systems.

Open-source artificial intelligence is the application of open-source practices to the development of artificial intelligence resources.

References

  1. 1 2 3 4 5 6 7 8 9 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv: 2101.00027 [cs.CL].
  2. 1 2 "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". EleutherAI Website. EleutherAI. 13 February 2020. Retrieved 4 June 2023.
  3. Brown, Tom B; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; et al. (22 Jul 2020). "Language Models are Few-Shot Learners". arXiv: 2005.14165 [cs.CL].
  4. Rosset, Corby (13 February 2020). "Turing-NLG: A 17-billion-parameter language model by Microsoft". Microsoft Blog. Microsoft . Retrieved 31 December 2020.
  5. 1 2 Gao, Leo; Biderman, Stella; Hoppe, Travis; Grankin, Mikhail; researcher2; trisongz; sdtblck (15 June 2021). "The Pile Replication Code". github.com. Retrieved 6 June 2023.{{cite web}}: CS1 maint: numeric names: authors list (link)
  6. Khan, Mehtab; Hanna, Alex (13 September 2022). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability" . Retrieved 8 March 2023 via papers.ssrn.com.
  7. Brownlee, Jason (10 August 2022). "Difference Between a Batch and an Epoch in a Neural Network" . Retrieved 2 June 2023 via machinelearningmastery.com.
  8. "GPT-Neo 125M". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  9. "GPT-Neo 1.3B". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  10. "GPT-Neo 2.7B". huggingface.co. 8 December 2022. Retrieved 7 June 2023.
  11. "Microsoft and Nvidia team up to train one of the world's largest language models". 11 October 2021. Retrieved 8 March 2023.
  12. "AI: Megatron the Transformer, and its related language models". 24 September 2021. Retrieved 8 March 2023.
  13. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv: 2205.01068 [cs.CL].
  14. Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Grave, Edouard; Lample, Guillaume; et al. (27 February 2023). "LLaMA: Open and Efficient Foundation Language Models". arXiv: 2302.13971 [cs.CL].
  15. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science". arXiv: 2211.09085 [cs.CL].
  16. "Model Card for BioMedLM 2.7B". huggingface.co. Retrieved 5 June 2023.
  17. Yuan, Sha; Zhao, Hanyu; Du, Zhengxiao; Ding, Ming; Liu, Xiao; Cen, Yukuo; Zou, Xu; Yang, Zhilin; Tang, Jie (1 January 2021). "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models". AI Open. 2: 65–68. doi:10.1016/j.aiopen.2021.06.001 . Retrieved 8 March 2023 via ScienceDirect.
  18. Grabovskiy, Ilya (2022). "Yandex publishes YaLM 100B, the largest GPT-like neural network in open source" (Press release). Yandex. Retrieved 5 June 2023.
  19. Rae, Jack W; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; et al. (21 Jan 2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv: 2112.11446 [cs.CL].
  20. Lieber, Opher; Sharir, Or; Lenz, Barak; Shoham, Yoav (1 August 2021). "Jurassic-1: Technical Details and Evaluation" (PDF). AI21 Labs. Retrieved 5 June 2023.
  21. "The Battle Over Books3 Could Change AI Forever". wired.com. Retrieved 13 October 2023.
  22. "Rights Alliance removes the illegal Books3 dataset used to train artificial intelligence". Rights Alliance. Retrieved 29 August 2023.
  23. "The Pile An 800GB Dataset of Diverse Text for Language Modeling". academictorrents.com. Retrieved 29 August 2023.
  24. "monology/pile-uncopyrighted - Dataset at Hugging Face".