HOME

TheInfoList



OR:

The Enron Corpus is a database of over 600,000
email Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
s generated by 158 employees of the
Enron Corporation Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded by Kenneth Lay in 1985 as a merger between Lay's Houston Natural Gas and InterNorth, both relatively small regional companies. B ...
in the years leading up to the company's collapse in December 2001. The corpus was generated from Enron email servers by the
Federal Energy Regulatory Commission The Federal Energy Regulatory Commission (FERC) is the United States federal agency that regulates the transmission and wholesale sale of electricity and natural gas in interstate commerce and regulates the transportation of oil by pipeline in ...
(FERC) during its subsequent investigation. A copy of the email database was subsequently purchased for $10,000 by
Andrew McCallum Andrew McCallum is a professor in the computer science department at University of Massachusetts Amherst. His primary specialties are in machine learning, natural language processing, information extraction, information integration, and social n ...
, a computer scientist at the
University of Massachusetts Amherst The University of Massachusetts Amherst (UMass Amherst, UMass) is a public research university in Amherst, Massachusetts and the sole public land-grant university in Commonwealth of Massachusetts. Founded in 1863 as an agricultural college, it ...
.Markoff, John.
Armies of Expensive Lawyers, Replaced by Cheaper Software
. ''New York Times'' March 5, 2011. p A1.
He released this copy to researchers, providing a trove of data that has been used for studies on
social networking A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for an ...
and
computer-mediated communication Computer-mediated communication (CMC) is defined as any human communication that occurs through the use of two or more electronic devices. While the term has traditionally referred to those communications that occur via computer-mediated formats ...
.


Creation

In the legal investigation into Enron's collapse, the
discovery Discovery may refer to: * Discovery (observation), observing or finding something unknown * Discovery (fiction), a character's learning something unknown * Discovery (law), a process in courts of law relating to evidence Discovery, The Discovery ...
process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part of
Lockheed Martin The Lockheed Martin Corporation is an American aerospace, arms, defense, information security, and technology corporation with worldwide interests. It was formed by the merger of Lockheed Corporation with Martin Marietta in March 1995. It ...
). The emails were collected at Enron Corporation headquarters in
Houston Houston (; ) is the most populous city in Texas, the most populous city in the Southern United States, the fourth-most populous city in the United States, and the sixth-most populous city in North America, with a population of 2,304,580 in ...
during two weeks in May 2002 by Joe Bartling, a litigation support and data analysis contractor for Aspen. In addition to the Enron employee emails, all of Enron's enterprise database systems, hosted in
Oracle database Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. It is a database commonly used for running online t ...
s on
Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, the ...
servers, were captured and preserved, including its online
energy trading A commodity market is a market that trades in the primary economic sector rather than manufactured products, such as cocoa, fruit and sugar. Hard commodities are mined, such as gold and oil. Futures contracts are the oldest way of investing ...
platform,
EnronOnline Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded by Kenneth Lay in 1985 as a merger between Lay's Houston Natural Gas and InterNorth, both relatively small regional companies. B ...
. Once collected, the Enron emails were processed and hosted in proprietary
electronic discovery Electronic discovery (also ediscovery or e-discovery) refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format (often refe ...
platforms (first Concordance, then iCONECT) for review by investigators from the FERC,
Commodity Futures Trading Commission The Commodity Futures Trading Commission (CFTC) is an independent agency of the US government created in 1974 that regulates the U.S. derivatives markets, which includes futures, swaps, and certain kinds of options. The Commodity Exchange Ac ...
, and
Department of Justice A justice ministry, ministry of justice, or department of justice is a ministry or other government agency in charge of the administration of justice. The ministry or department is often headed by a minister of justice (minister for justice in a ...
. At the conclusion of the investigation, and upon the issuance of the FERC staff report, the emails and information collected were deemed to be in the
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
, to be used for
historical research The Institute of Historical Research (IHR) is a British educational organisation providing resources and training for historical researchers. It is part of the School of Advanced Study in the University of London and is located at Senate Hou ...
and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available on
hard drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
s. Jitesh Shetty and Jafar Adibi from the
University of Southern California The University of Southern California (USC, SC, or Southern Cal) is a Private university, private research university in Los Angeles, California, United States. Founded in 1880 by Robert M. Widney, it is the oldest private research university in C ...
processed the data in 2004 and released a
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
version. In 2010, EDRM.net published a revised and expanded version 2 of the corpus, containing over 1.7 million messages, which has been made available on
Amazon S3 Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
for easy access to the researchers.


Exploitation

The corpus is valued as one of the few publicly available mass collections of real emails easily available for study; such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access, such as
non-disclosure agreement A non-disclosure agreement (NDA) is a legal contract or part of a contract between at least two parties that outlines confidential material, knowledge, or information that the parties wish to share with one another for certain purposes, but wish ...
s and
data sanitization Data sanitization involves the secure and permanent erasure of sensitive data from datasets and media to guarantee that no residual data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications b ...
. Shetty and Adibi, based on their MySQL version, published some
link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Lin ...
of which user accounts emailed which. Linguistic comparison with more recent email
corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
shows
changes Changes may refer to: Books * ''Changes'', the 12th novel in Jim Butcher's ''The Dresden Files'' Series * ''Changes'', a novel by Danielle Steel * ''Changes'', a trilogy of novels on which the BBC TV series was based, written by Peter Dickinson ...
in the email
register Register or registration may refer to: Arts entertainment, and media Music * Register (music), the relative "height" or range of a note, melody, part, instrument, etc. * ''Register'', a 2017 album by Travis Miller * Registration (organ), th ...
of English. It is also used as test or training data for research in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.


References


External links


Tutorial on data modeling with the Enron CorpusShetty and Adibi's enron email dataset download on S3
(178 MB) *Nathan Heller
What the Enron E-mails Say About Us
The New Yorker, July 24, 2017
Searchable Enron Email Database
(requires registration)
Open Test Search
Searchable corpus of all email attachments used to compare different enterprise search engines. {{Corpus linguistics Enron Email English corpora Corpora