BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

s including Google's BERT. The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy. The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service. The dataset was initially hosted on a University of Toronto webpage. An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created. Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.

References

{{reflist, refs= {{Cite web, url=https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, title=Improving Language Understanding by Generative Pre-Training, access-date=June 9, 2020, archive-date=January 26, 2021, archive-url=https://web.archive.org/web/20210126024542/https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, url-status=live {{cite arXiv , last1=Devlin , first1=Jacob , last2=Chang , first2=Ming-Wei , last3=Lee , first3=Kenton , last4=Toutanova , first4=Kristina , title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , date=11 October 2018 , eprint=1810.04805v2, class=cs.CL {{cite web , title=Google swallows 11,000 novels to improve AI's conversation , last=Lea, first=Richard , work=The Guardian , date=28 September 2016 , url=https://www.theguardian.com/books/2016/sep/28/google-swallows-11000-novels-to-improve-ais-conversation {{cite conference , last1=Zhu , first1=Yukun , last2=Kiros , first2=Ryan , last3=Zemel , first3=Rich , last4=Salakhutdinov , first4=Ruslan , last5=Urtasun , first5=Raquel , last6=Torralba , first6=Antonio , last7=Fidler , first7=Sanja , title=Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , work=Proceedings of the IEEE International Conference on Computer Vision (ICCV) , year=2015, url=https://www.cv-foundation.org/openaccess/content_iccv_2015/html/Zhu_Aligning_Books_and_ICCV_2015_paper.html {{cite conference , title=Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus , book-title

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , last1=Bandy, first1=John, last2=Vincent, first2=Nicholas , year=2021 , url=https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf Datasets in machine learning English corpora