The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

s (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.

Creation

Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the

Common Crawl Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...

. However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it.

Contents and filtering

Artificial intelligences Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...

do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch". Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to the perceived quality of the data. The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been converted to GB, and asterisks are used to indicate the newly introduced datasets. EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with. All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

formatting and links. Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content. Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However, EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards.

Use

The Pile was originally developed to train EleutherAI's GPT-Neo models but has become widely used to train other models, including

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...

's Megatron-Turing Natural Language Generation,

Meta AI Meta AI is an artificial intelligence laboratory that belongs to Meta Platforms Inc. (formerly known as Facebook, Inc.) Meta AI intends to develop various forms of artificial intelligence, improving augmented and artificial reality technologies ...

's Open Pre-trained Transformers,

LLaMA The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the Pre-Columbian era. Llamas are social animals and live with othe ...

, and Galactica,

Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...

's BioMedLM 2.7B, the

Beijing Academy of Artificial Intelligence Beijing Academy of Artificial Intelligence (BAAI) (), also known as Zhiyuan Institute, is a Chinese non-profit artificial intelligence (AI) research laboratory. BAAI conducts AI research and is dedicated to promoting collaboration among academia ...

's Chinese-Transformer-XL,

Yandex Yandex LLC (russian: link=no, Яндекс, p=ˈjandəks) is a Russian multinational technology company providing Internet-related products and services, including an Internet search engine, information services, e-commerce, transportation, maps ...

's YaLM 100B, and

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...

's OpenELM. In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles.

DMCA takedown

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. Users responded by creating copies of The Pile with the offending content removed.

References

{{reflist Datasets in machine learning Statistical data sets Large language models

Creation

Contents and filtering

Use

DMCA takedown

See also

References