The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. [1] [2] It is composed of 22 smaller datasets, including 14 new ones. [1]
Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the Common Crawl. [3] However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. [4] The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. [1] [5] Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it. [6]
Artificial intelligences do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch". [7] Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to the perceived quality of the data. [1] The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been converted to GB, and asterisks are used to indicate the newly introduced datasets.
Component | Original size | Epochs | Effective size |
---|---|---|---|
Pile-CC | 243.87 GB | 1 | 243.87 GB |
PubMed Central* | 96.93 GB | 2 | 193.86 GB |
Books3 | 108.40 GB | 1.5 | 162.61 GB |
OpenWebText2* | 67.40 GB | 2 | 134.80 GB |
arXiv* | 60.36 GB | 2 | 120.71 GB |
GitHub* | 102.18 GB | 1 | 102.18 GB |
Free Law* | 54.92 GB | 1.5 | 82.39 GB |
Stack Exchange* | 34.57 GB | 2 | 69.14 GB |
USPTO Backgrounds* | 24.59 GB | 2 | 49.19 GB |
PubMed Abstracts* | 20.68 GB | 2 | 41.37 GB |
Gutenberg (PG-19) | 11.68 GB | 2.5 | 29.20 GB |
OpenSubtitles | 13.94 GB | 1.5 | 20.91 GB |
Wikipedia | 6.85 GB | 3 | 20.54 GB |
DeepMind Mathematics | 8.32 GB | 2 | 16.63 GB |
Ubuntu Freenode IRC logs* | 5.93 GB | 2 | 11.84 GB |
BookCorpus2* | 6.76 GB | 1.5 | 10.15 GB |
EuroParl | 4.93 GB | 2 | 9.85 GB |
Hacker News* | 4.19 GB | 2 | 8.38 GB |
YouTube Subtitles* | 4.01 GB | 2 | 8.02 GB |
PhilPapers* | 2.56 GB | 2 | 5.11 GB |
NIH ExPorter* | 2.03 GB | 2 | 4.07 GB |
Enron Emails | 0.95 GB | 2 | 1.89 GB |
Total | 886.03 GB | 1346.69 GB |
EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with. [1]
All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links. [1]
Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content. [1]
Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However, EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards. [1]
The Pile was originally developed to train EleutherAI's GPT-Neo models [8] [9] [10] but has become widely used to train other models, including Microsoft's Megatron-Turing Natural Language Generation, [11] [12] Meta AI's Open Pre-trained Transformers, [13] LLaMA, [14] and Galactica, [15] Stanford University's BioMedLM 2.7B, [16] the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL, [17] Yandex's YaLM 100B, [18] and Apple's OpenELM. [19]
In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles. [2] [20] [21]
The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. [22] In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. [23] [24] Users responded by creating copies of The Pile with the offending content removed. [25]
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.
Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.
Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.
Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.
A foundation model, also known as large X model (LxM), is a machine learning or deep learning model that is trained on vast datasets so it can be applied across a wide range of use cases. Generative AI applications like Large Language Models are often examples of foundation models.
Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.
Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.
A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.
GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.
EleutherAI is a grass-roots non-profit artificial intelligence (AI) research group. The group, considered an open-source version of OpenAI, was formed in a Discord server in July 2020 by Connor Leahy, Sid Black, and Leo Gao to organize a replication of GPT-3. In early 2023, it formally incorporated as the EleutherAI Institute, a non-profit research institute.
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024.
PaLM is a 540 billion-parameter transformer-based large language model (LLM) developed by Google AI. Researchers also trained smaller versions of PaLM to test the effects of model scale.
In machine learning, the term stochastic parrot is a metaphor to describe the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. The term was coined by Emily M. Bender in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.
Vicuna LLM is an omnibus Large Language Model used in AI research. Its methodology is to enable the public at large to contrast and compare the accuracy of LLMs "in the wild" and to vote on their output; a question-and-answer chat format is used. At the beginning of each round two LLM chatbots from a diverse pool of nine are presented randomly and anonymously, their identities only being revealed upon voting on their answers. The user has the option of either replaying ("regenerating") a round, or beginning an entirely fresh one with new LLMs. Based on Llama 2, it is an open source project, and it itself has become the subject of academic research in the burgeoning field. A non-commercial, public demo of the Vicuna-13b model is available to access using LMSYS.
"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.
T5 is a series of large language models developed by Google AI introduced in 2019. Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.