HOME

TheInfoList



OR:

The ACL Data Collection Initiative (ACL/DCI) was a project established in 1989 by the
Association for Computational Linguistics The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language proce ...
(ACL) to create and distribute large text and speech corpora for computational linguistics research. The initiative aimed to address the growing need for substantial text databases that could support research in areas such as
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
, and
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
. By 1993, the initiative’s activities had effectively ceased, with its functions and datasets absorbed by the
Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and develop ...
(LDC), which was founded in 1992.


Objectives

The ACL/DCI had several key objectives: * To acquire a large and diverse text corpus from various sources * To transform the collected texts into a common format based on the Standard Generalized Markup Language (SGML) * To make the corpus available for scientific research at low cost with minimal restrictions * To provide a common database that would allow researchers to replicate or extend published results * To reduce duplication of effort among researchers in obtaining and preparing text data These objectives were designed to address the growing demand for very large amounts of text arising from applications in recognition and analysis of text and speech. Its core objective was to "oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties".


History

By the late 1980s, researchers in computational linguistics and speech recognition faced a significant problem: the lack of large-scale, accessible text corpora for developing statistical models and testing algorithms. Existing generally available text databases were too small to meet the needs of developing applications in text and speech recognition. The initiative was formed to meet this need by collecting, standardizing, and distributing large quantities of text data with minimal restrictions for scientific research. As stated by Liberman (1990), "research workers have been severely hampered by the lack of appropriate materials, and specially by the lack of a large enough body of text on which published results can be replicated or extended by others." The ACL/DCI committee was established in February 1989. The committee included members from academic and industrial research laboratories in the United States and Europe. The initiative was chaired by
Mark Liberman Mark Yoffe Liberman is an American linguist. He is Christopher H. Browne Distinguished Professor of Linguistics at the University of Pennsylvania, with a dual appointment as Professor in the Department of Computer and Information Science. He is ...
from the
University of Pennsylvania The University of Pennsylvania (Penn or UPenn) is a Private university, private Ivy League research university in Philadelphia, Pennsylvania, United States. One of nine colonial colleges, it was chartered in 1755 through the efforts of f ...
(formerly of
AT&T Bell Laboratories Nokia Bell Labs, commonly referred to as ''Bell Labs'', is an American industrial research and development company owned by Finnish technology company Nokia. With headquarters located in Murray Hill, New Jersey, Murray Hill, New Jersey, the compa ...
). Other committee members included representatives from organizations such as
Bellcore iconectiv supplies communications providers with network planning and management services. The company’s cloud-based information as a service network and operations management and numbering solutions span trusted communications, digital identi ...
, IBM T.J. Watson Research Center,
Cambridge University The University of Cambridge is a Public university, public collegiate university, collegiate research university in Cambridge, England. Founded in 1209, the University of Cambridge is the List of oldest universities in continuous operation, wo ...
, Virginia Polytechnic Institute & State University,
Northeastern University Northeastern University (NU or NEU) is a private university, private research university with its main campus in Boston, Massachusetts, United States. It was founded by the Boston Young Men's Christian Association in 1898 as an all-male instit ...
,
University of Pennsylvania The University of Pennsylvania (Penn or UPenn) is a Private university, private Ivy League research university in Philadelphia, Pennsylvania, United States. One of nine colonial colleges, it was chartered in 1755 through the efforts of f ...
,
SRI International SRI International (SRI) is a nonprofit organization, nonprofit scientific research, scientific research institute and organization headquartered in Menlo Park, California, United States. It was established in 1946 by trustees of Stanford Univer ...
, MCC,
Xerox PARC Future Concepts division (formerly Palo Alto Research Center, PARC and Xerox PARC) is a research and development company in Palo Alto, California. It was founded in 1969 by Jacob E. "Jack" Goldman, chief scientist of Xerox Corporation, as a div ...
, ISSCO, and
University of Pisa The University of Pisa (, UniPi) is a public university, public research university in Pisa, Italy. Founded in 1343, it is one of the oldest universities in Europe. Together with Scuola Normale Superiore di Pisa and Sant'Anna School of Advanced S ...
. The project operated initially without dedicated funding, relying on volunteer efforts from committee members and their affiliated institutions. Key supporters included AT&T Bell Labs, Bellcore, IBM, Xerox, and the University of Pennsylvania, which allowed the use of their computing facilities for ACL/DCI-related work. Previously running on volunteer effort ''
pro bono ( English: 'for the public good'), usually shortened to , is a Latin phrase for professional work undertaken voluntarily and without payment. The term traditionally referred to provision of legal services by legal professionals for people who a ...
'', in 1991, it obtained funding from
General Electric General Electric Company (GE) was an American Multinational corporation, multinational Conglomerate (company), conglomerate founded in 1892, incorporated in the New York (state), state of New York and headquartered in Boston. Over the year ...
and the
National Science Foundation The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
(IRI-9113530).README file of ACL/DCI CD-ROM 1, September, 1991
/ref>


Data

As of 1990, the ACL/DCI had collected hundreds of millions of words of diverse text. The collection included: *
Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...
articles (25 to 50 million words); * Canadian Hansard (parliamentary records) in
parallel Parallel may refer to: Mathematics * Parallel (geometry), two lines in the Euclidean plane which never intersect * Parallel (operator), mathematical operation named after the composition of electrical resistance in parallel circuits Science a ...
English and French versions: cleaned-up English Hansard donated by the
IBM alignment models International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is a publicly traded company ...
group (100 million words), and original Bilingual Hansard (from a different time period) obtained directly (200 million words). *
Collins English Dictionary The ''Collins English Dictionary'' is a printed and online dictionary of English. It is published by HarperCollins in Glasgow. It was first published in 1979. Corpus The dictionary uses language research based on the Collins Corpus, which is ...
(1979 edition), both as fulltext (3 million words) and as various "database" versions, constructed using "typographers' tape" donated by Collins, which were computer tapes containing the structured digital data used to typeset and print the 1979 edition of the dictionary; * Emails from
ARPANET The Advanced Research Projects Agency Network (ARPANET) was the first wide-area packet-switched network with distributed control and one of the first computer networks to implement the TCP/IP protocol suite. Both technologies became the tec ...
newsletters for the ACM Special Interest Group on Information Retrieval Forum (IRLIST) and AIList Digest issues distributed over the ARPANET (AILIST) (5 million words), both collected by Edward A. Fox at VIPSU; * Articles on networking (2 million words); * U.S. Department of Agriculture Extension Service Fact Sheets (>1 million words); * 200,000 scientific abstracts of about 1,500 words each from the
Department of Energy A ministry of energy or department of energy is a government department in some countries that typically oversees the production of fuel and electricity; in the United States, however, it manages nuclear weapons development and conducts energy-rela ...
(25 million words); * Archives of the Challenger Investigation Commission, including transcripts of depositions and hearings (2.5 million words); * Books from the
Library of America The Library of America (LOA) is a nonprofit publisher of classic American literature. Founded in 1979 with seed money from the National Endowment for the Humanities and the Ford Foundation, the LOA has published more than 300 volumes by authors ...
, including works by
Mark Twain Samuel Langhorne Clemens (November 30, 1835 – April 21, 1910), known by the pen name Mark Twain, was an American writer, humorist, and essayist. He was praised as the "greatest humorist the United States has produced," with William Fau ...
,
Eugene O'Neill Eugene Gladstone O'Neill (October 16, 1888 – November 27, 1953) was an American playwright. His poetically titled plays were among the first to introduce into the U.S. the drama techniques of Realism (theatre), realism, earlier associated with ...
,
Ralph Waldo Emerson Ralph Waldo Emerson (May 25, 1803April 27, 1882), who went by his middle name Waldo, was an American essayist, lecturer, philosopher, minister, abolitionism, abolitionist, and poet who led the Transcendentalism, Transcendentalist movement of th ...
,
Herman Melville Herman Melville (Name change, born Melvill; August 1, 1819 – September 28, 1891) was an American novelist, short story writer, and poet of the American Renaissance (literature), American Renaissance period. Among his best-known works ar ...
, W.E.B. DuBois,
Willa Cather Willa Sibert Cather (; born Wilella Sibert Cather; December 7, 1873 – April 24, 1947) was an American writer known for her novels of life on the Great Plains, including ''O Pioneers!'', ''The Song of the Lark (novel), The Song of the Lark'', a ...
, and
Benjamin Franklin Benjamin Franklin (April 17, 1790) was an American polymath: a writer, scientist, inventor, statesman, diplomat, printer, publisher and Political philosophy, political philosopher.#britannica, Encyclopædia Britannica, Wood, 2021 Among the m ...
(130 books, 20 million words); * Public domain books like the
King James Bible The King James Version (KJV), also the King James Bible (KJB) and the Authorized Version (AV), is an Early Modern English translation of the Christian Bible for the Church of England, which was commissioned in 1604 and published in 1611, by ...
,
Tristram Shandy Tristram may refer to: Literature * the title character of ''The Life and Opinions of Tristram Shandy, Gentleman'', a novel by Laurence Sterne * the title character of '' Tristram of Lyonesse'', an epic poem by Algernon Charles Swinburne *"Tristr ...
,
The Federalist Papers ''The Federalist Papers'' is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay under the collective pseudonym "Publius" to promote the ratification of the Constitution of the United States. The ...
; * Several million words of transcribed
radiologists Radiology ( ) is the medical specialty that uses medical imaging to diagnose diseases and guide treatment within the bodies of humans and other animals. It began with radiography (which is why its name has a root referring to radiation), but tod ...
' reports, donated by Francis Ganong at Kurzweil Applied Intelligence Inc (about 5 million words); * The Child Language Data Exchange corpus of child language acquisition transcripts; *
U.S. Department of Justice The United States Department of Justice (DOJ), also known as the Justice Department, is a federal executive department of the U.S. government that oversees the domestic enforcement of federal laws and the administration of justice. It is equi ...
Justice Retrieval and Inquiry System (JURIS) materials; * The
Swiss Civil Code The Swiss Civil Code (SR/RS 210, ; ; ; ) is a portion of the second part (SR/RS 2) of the internal Swiss law ("Private law - Administration of civil justice - Enforcement") that regulates the codified law ruling in Switzerland and relationshi ...
in parallel German, French and Italian; * Economic reports from the
Union Bank of Switzerland Union Bank of Switzerland (UBS) was a Swiss Investment banking, investment bank and financial services company located in Switzerland. The bank, which at the time was the second largest bank in Switzerland, merged with Swiss Bank Corporation in ...
, in parallel English, German, French and Italian; * About 12K words of administrative policy manuals and 14K words of administrative memos, contributed by Geoff Pullum of U.C.S.C.; * Material from various ACM journals and the ACL journal ''Computational Linguistics''; * The CSLI publications series: 50-100 reports (8K words each) and 5-10 books (80K words each). The initiative started with North American English text but expanded to include
Canadian French Canadian French (, ) is the French language as it is spoken in Canada. It includes multiple varieties, the most prominent of which is Québécois (Quebec French). Formerly ''Canadian French'' referred solely to Quebec French and the closely re ...
and planned to include Japanese, Chinese, and other Asian languages. At least 5 million words from the collection were tagged under the Penn Treebank project, and those tags were distributed by DCI as well. After DCI was absorbed by the LDC, the datasets were curated under LDC.


Format

The ACL/DCI corpus was coded in a standard form based on SGML (
Standard Generalized Markup Language The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should de ...
, ISO 8879), consistent with the recommendations of the
Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...
(TEI), of which the DCI was an affiliated project. The TEI was a joint project of the ACL, the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing, aiming to provide a common interchange format for literary and linguistic data. The initiative planned to add annotations reflecting consensually approved linguistic features like
part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
and various aspects of syntactic and semantic structure over time.


Examples

As an example of the use of ACL/DCI, consider the
Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...
(WSJ) corpus for
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
research. The WSJ corpus was used as the basis for the
DARPA The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Adva ...
Spoken Language System (SLS) community's Continuous Speech Recognition (CSR) Corpus. The WSJ corpus became a standard benchmark for evaluating speech recognition systems and has been used in numerous research papers. The WSJ CSR Corpus provided DARPA with its first general-purpose English, large vocabulary, natural language, high
perplexity In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the ...
corpus containing speech (400 hours) and text (47 million words) during 1987–89. The text corpus was 313 MB in size. The text was preprocessed to remove ambiguity in the word sequence that a reader might choose, ensuring that the unread text used to train language models was representative of the spoken test material. The preprocessing included converting numbers into orthographics, expanding
abbreviation An abbreviation () is a shortened form of a word or phrase, by any method including shortening (linguistics), shortening, contraction (grammar), contraction, initialism (which includes acronym), or crasis. An abbreviation may be a shortened for ...
s, resolving
apostrophe The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...
s and
quotation mark Quotation marks are punctuation marks used in pairs in various writing systems to identify direct speech, a quotation, or a phrase. The pair consists of an opening quotation mark and a closing quotation mark, which may or may not be the sam ...
s, and marking
punctuation Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
. As another example, the Yarowsky algorithm used bitext data from DCI to train a simple
word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...
model that was competitive with advanced models trained on smaller datasets.{{Cite journal , last=Gale , first=William A. , last2=Church , first2=Kenneth W. , last3=Yarowsky , first3=David , date=December 1992 , title=A method for disambiguating word senses in a large corpus , url=https://doi.org/10.1007/bf00136984 , journal=Computers and the Humanities , volume=26 , issue=5-6 , pages=415–439 , doi=10.1007/bf00136984 , issn=0010-4817, url-access=subscription


Distribution

Materials from the ACL/DCI collection were distributed to research groups on a non-commercial basis. By 1990, about 25 research groups and individual researchers had received tapes containing various portions of the collected material. To obtain the data, researchers had to sign an agreement not to redistribute the data or make direct commercial use of it. However, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, was explicitly permitted. The initiative first distributed data via 12-inch reels of
9-track tape 9-track tape is a format for magnetic-tape data storage, introduced with the IBM System/360 in 1964. The wide magnetic tape media and reels have the same size as the earlier IBM 7-track format it replaced, but the new format has eight data ...
, then via CD-ROMs. Each such tape could contain 30 million words compressed via the Lempel-Ziv algorithms. The first CD-ROM distribution was in 1991, funded by Dragon Systems Inc. It contained Collins English Dictionary, WSJ, scientific abstracts provided by the U.S. Department of Energy, and the Penn Treebank.


See also

*
Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and develop ...
* Penn Treebank *
Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...
*
Computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...


References

Computational linguistics Natural language processing Speech recognition Association for Computational Linguistics Datasets Datasets in machine learning