The Pile (dataset)
   HOME
*





The Pile (dataset)
The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones. Creation Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the Common Crawl. However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thor ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Large Language Model
A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 and perform well at a wide variety of tasks. This has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks. Properties Though the term ''large language model'' has no formal definition, it often refers to deep learning models having a parameter count on the order of billions or more. LLMs are general purpose models which excel at a wide range of tasks, as opposed to being trained for one specific task (such as sentiment analysis, named entity recognition, or mathematical reasoning). The skill with which they accomplish tasks, and the range of tasks at which they are capable, seems to be a function of the amount of resources (data, parameter-siz ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Freenode
Freenode, stylized as freenode and formerly known as Open Projects Network, is an IRC network which was previously used to discuss peer-directed projects. Their servers are accessible from the hostname , which load balances connections by using round-robin DNS. On 19 May 2021, Freenode underwent what some staff described as a "hostile takeover" and at least 14 volunteer staff members resigned. Following the events, various organisations using Freenode – including Arch Linux, CentOS, FreeBSD, Free Software Foundation Europe, Gentoo Linux, KDE, LineageOS, Slackware, Ubuntu, and the Wikimedia Foundation – moved their channels to Libera Chat, a network created by former Freenode staff. Others like Haiku or Alpine Linux moved to the Open and Free Technology Community (OFTC). As of 16 August 2021, over a thousand projects have left Freenode. History Freenode began as a four-person Linux support channel called on EFnet, another IRC network. By 1995, after moving to Unde ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Stanford University
Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is considered among the most prestigious universities in the world. Stanford was founded in 1885 by Leland and Jane Stanford in memory of their only child, Leland Stanford Jr., who had died of typhoid fever at age 15 the previous year. Leland Stanford was a U.S. senator and former governor of California who made his fortune as a railroad tycoon. The school admitted its first students on October 1, 1891, as a coeducational and non-denominational institution. Stanford University struggled financially after the death of Leland Stanford in 1893 and again after much of the campus was damaged by the 1906 San Francisco earthquake. Following World War II, provost of Stanford Frederick Terman inspired and supported faculty and graduates' entrepreneu ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

LLaMA
The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the Pre-Columbian era. Llamas are social animals and live with others as a herd. Their wool is soft and contains only a small amount of lanolin. Llamas can learn simple tasks after a few repetitions. When using a pack, they can carry about 25 to 30% of their body weight for 8 to 13 kilometre, km (5–8 miles). The name ''llama'' (in the past also spelled "lama" or "glama") was adopted by European colonization of the Americas, European settlers from Indigenous people in Peru, native Peruvians. The ancestors of llamas are thought to have originated from the Great Plains of North America about 40 million years ago, and subsequently migrated to South America about three million years ago during the Great American Interchange. By the end of the last Quaternary glaciation, ice age (10,000–12,000 years ago), ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Meta Platforms
Meta Platforms, Inc., (file no. 3835815) doing business as Meta and formerly named Facebook, Inc., and TheFacebook, Inc., is an American multinational technology conglomerate based in Menlo Park, California. The company owns Facebook, Instagram, and WhatsApp, among other products and services. Meta was once one of the world's most valuable companies, but as of 2022 is not one of the top twenty biggest companies in the United States. It is considered one of the Big Five American information technology companies, alongside Alphabet ( Google), Amazon, Apple, and Microsoft. As of 2022, it is the least profitable of the five. Meta's products and services include Facebook, Messenger, Facebook Watch, and Meta Portal. It has also acquired Oculus, Giphy, Mapillary, Kustomer, Presize and has a 9.99% stake in Jio Platforms. In 2021, the company generated 97.5% of its revenue from the sale of advertising. In October 2021, the parent company of Facebook changed its name fro ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Congressional Record
The ''Congressional Record'' is the official record of the proceedings and debates of the United States Congress, published by the United States Government Publishing Office and issued when Congress is in session. The Congressional Record Index is updated daily online and published monthly. At the end of a session of Congress, the daily editions are compiled in bound volumes constituting the permanent editionChapter 9 of Title 44 of the United States Codeauthorizes publication of the ''Congressional Record''. The ''Congressional Record'' consists of four sections: the House section, the Senate section, the Extensions of Remarks, and, since the 1940s, the Daily Digest. At the back of each daily issue is the Daily Digest, which summarizes the day's floor and committee activities and serves as a table of contents for each issue. The House and Senate sections contain proceedings for the separate chambers of Congress. A section of the ''Congressional Record'' titled ''Extensions of ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript. Web browsers receive HTML documents from a web server or from local storage and render the documents into multimedia web pages. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document. HTML elements are the building blocks of HTML pages. With HTML constructs, images and other objects such as interactive forms may be embedded into the rendered page. HTML provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes, and other items. HTML elements are delineated by ''tags'', written using angle brackets. Tags such as and directly introduce content into the page. Other tags such as surround ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Enron Corpus
The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse in December 2001. The corpus was generated from Enron email servers by the Federal Energy Regulatory Commission (FERC) during its subsequent investigation. A copy of the email database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.Markoff, John.Armies of Expensive Lawyers, Replaced by Cheaper Software. ''New York Times'' March 5, 2011. p A1. He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer-mediated communication. Creation In the legal investigation into Enron's collapse, the discovery process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part of Lockheed Martin). The emails were collected at Enron Corporation headquart ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

National Institutes Of Health
The National Institutes of Health, commonly referred to as NIH (with each letter pronounced individually), is the primary agency of the United States government responsible for biomedical and public health research. It was founded in the late 1880s and is now part of the United States Department of Health and Human Services. The majority of NIH facilities are located in Bethesda, Maryland, and other nearby suburbs of the Washington metropolitan area, with other primary facilities in the Research Triangle Park in North Carolina and smaller satellite facilities located around the United States. The NIH conducts its own scientific research through the NIH Intramural Research Program (IRP) and provides major biomedical research funding to non-NIH research facilities through its Extramural Research Program. , the IRP had 1,200 principal investigators and more than 4,000 postdoctoral fellows in basic, translational, and clinical research, being the largest biomedical research instit ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


PhilPapers
PhilPapers is an interactive academic database of journal articles in philosophy. It is maintained by the Centre for Digital Philosophy at the University of Western Ontario, and as of 2022, it has "394,867 registered users, including the majority of professional philosophers and graduate students." The general editors are its founders, David Bourget and David Chalmers. PhilPapers receives financial support from other organizations, including a substantial grant in early 2009 from the Joint Information Systems Committee Jisc is a United Kingdom not-for-profit company that provides network and IT services and digital resources in support of further and higher education institutions and research as well as not-for-profits and the public sector. History T ... in the United Kingdom. The archive is praised for its comprehensiveness and organization, and for its regular updates. In addition to archiving papers, the editors run and publish the most extensive ongoing survey ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

YouTube
YouTube is a global online video platform, online video sharing and social media, social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google, and is the List of most visited websites, second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users who collectively watch more than one billion hours of videos each day. , videos were being uploaded at a rate of more than 500 hours of content per minute. In October 2006, YouTube was bought by Google for $1.65 billion. Google's ownership of YouTube expanded the site's business model, expanding from generating revenue from advertisements alone, to offering paid content such as movies and exclusive content produced by YouTube. It also offers YouTube Premium, a paid subscription option for watching content without ads. YouTube also approved creators to participate in Google's Google AdSens ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Hacker News
Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity." The word ''hacker'' in "Hacker News" is used in its original meaning and refers to the hacker culture which consists of people who enjoy tinkering with technology. History The site was created by Paul Graham in February 2007. Initially called Startup News or occasionally News.YC., it became known by its current name on August 14, 2007. It developed as a project of Graham's company Y Combinator, functioning as a real-world application of the Arc programming language which Graham co-developed. At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by D ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]