Enron Corpus

Last updated

The Enron Corpus is a database of over 600,000 emails generated by 158 employees [1] of the Enron Corporation in the years leading up to the company's collapse in December 2001. The corpus was generated from Enron email servers by the Federal Energy Regulatory Commission (FERC) during its subsequent investigation. [2] A copy of the email database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst. [3] He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer-mediated communication.

Contents

Creation

In the legal investigation into Enron's collapse, the discovery process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part of Lockheed Martin). The emails were collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by Joe Bartling, [4] a litigation support and data analysis contractor for Aspen. In addition to the Enron employee emails, all of Enron's enterprise database systems, [5] hosted in Oracle databases on Sun Microsystems servers, were captured and preserved, including its online energy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in proprietary electronic discovery platforms (first Concordance, then iCONECT) for review by investigators from the FERC, Commodity Futures Trading Commission, and Department of Justice. At the conclusion of the investigation, and upon the issuance of the FERC staff report, [6] the emails and information collected were deemed to be in the public domain, to be used for historical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available on hard drives.

Jitesh Shetty and Jafar Adibi from the University of Southern California processed the data in 2004 and released a MySQL version. [7] In 2010, EDRM.net published a revised and expanded version 2 of the corpus, [8] containing over 1.7 million messages, which has been made available on Amazon S3 for easy access to the researchers.

Exploitation

A visualization of the email network in the Enron Corpus, with coloring representing eight communities Enron Email Network.jpg
A visualization of the email network in the Enron Corpus, with coloring representing eight communities

The corpus is valued as one of the few publicly available mass collections of real emails easily available for study; such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access, such as non-disclosure agreements and data sanitization. [3] Shetty and Adibi, based on their MySQL version, published some link analysis of which user accounts emailed which. [9] Linguistic comparison with more recent email corpora shows changes in the email register of English. It is also used as test or training data for research in natural language processing and machine learning. [10]

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

<span class="mw-page-title-main">Enron</span> American energy company

Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded by Kenneth Lay in 1985 as a merger between Lay's Houston Natural Gas and InterNorth, both relatively small regional companies. Before its bankruptcy on December 2, 2001, Enron employed approximately 20,600 staff and was a major electricity, natural gas, communications, and pulp and paper company, with claimed revenues of nearly $101 billion during 2000. Fortune named Enron "America's Most Innovative Company" for six consecutive years.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

The 2000–2001 California electricity crisis, also known as the Western U.S. energy crisis of 2000 and 2001, was a situation in which the U.S. state of California had a shortage of electricity supply caused by market manipulations and capped retail electricity prices. The state suffered from multiple large-scale blackouts, one of the state's largest energy companies collapsed, and the economic fall-out greatly harmed Governor Gray Davis's standing.

Oracle TimesTen In-Memory Database is an in-memory, relational database management system with persistence and high availability. Originally designed and implemented at Hewlett-Packard labs in Palo Alto, California, TimesTen spun out into a separate startup in 1996 and was acquired by Oracle Corporation in 2005.

<span class="mw-page-title-main">Hierarchical Data Format</span> Set of file formats

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

MySQL Cluster is a technology providing shared-nothing clustering and auto-sharding for the MySQL database management system. It is designed to provide high availability and high throughput with low latency, while allowing for near linear scalability. MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Rhizome Navigation is a method of dynamically creating a navigation interface for data systems, such as websites and databases. The navigation links presented to the user are not predefined, they are generated in response to user behavior, and analysis of other data.

Electronic discovery refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format. Electronic discovery is subject to rules of civil procedure and agreed-upon processes, often involving review for privilege and relevance before data are turned over to the requesting party.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Hibari is a strongly consistent, highly available, distributed, key-value Big Data store. It was developed by Cloudian, Inc., formerly Gemini Mobile Technologies to support its mobile messaging and email services and released as open-source on July 27, 2010.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

BigQuery is a managed, serverless data warehouse product by Google, offering scalable analysis over large quantities of data. It is a Platform as a Service (PaaS) that supports querying using a dialect of SQL. It also has built-in machine learning capabilities. BigQuery was announced in May 2010 and made generally available in November 2011.

<span class="mw-page-title-main">Hydropower policy of the United States</span>

Hydropower policy in the United States includes all the laws, rules, regulations, programs and agencies that govern the national hydroelectric industry. Federal policy concerning waterpower developed over considerable time before the advent of electricity, and at times, has changed considerably, as water uses, available scientific technologies and considerations developed to the present day; over this period the priority of different, pre-existing and competing uses for water, flowing water and its energy, as well as for the water itself and competing available sources of energy have changed. Increased population and commercial demands spurred this developmental growth and many of the changes since, and these affect the technology's use today.

<span class="mw-page-title-main">Amazon DynamoDB</span> NoSQL database service

Amazon DynamoDB is a fully managed proprietary NoSQL database offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB offers a fast persistent key–value datastore with built-in support for replication, autoscaling, encryption at rest, and on-demand backup among other features.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

Open energy system database projects employ open data methods to collect, clean, and republish energy-related datasets for open use. The resulting information is then available, given a suitable open license, for statistical analysis and for building numerical energy system models, including open energy system models. Permissive licenses like Creative Commons CC0 and CC BY are preferred, but some projects will house data made public under market transparency regulations and carrying unqualified copyright.

References

  1. Klimt, Bryan; Yiming Yang (2004). "The Enron Corpus: A New Dataset for Email Classification Research". pp. 217–226. CiteSeerX   10.1.1.61.1645 .
  2. "The Enron Email Corpus Archived 2011-03-08 at the Wayback Machine " Retrieved March 5, 2011.
  3. 1 2 Markoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software". New York Times March 5, 2011. p A1.
  4. Bartling, Joe (September 3, 2015). "The Enron Data Set - Where Did It Come From?". Bartling Forensic and Advisory. Retrieved September 3, 2015.
  5. "FERC: Industries - Enron's Energy Trading Business Process and Databases". www.ferc.gov. Archived from the original on 2020-01-05. Retrieved 2015-09-02.
  6. FERC Staff Report - Price Manipulation in Western Markets - Findings at a Glance Archived 2006-02-21 at the Wayback Machine (3-26-2003)
  7. "Enron processed database"
  8. Socha, George. "EDRM Enron Email Data Set v2 Now Available". EDRM.net. Archived from the original on 2011-09-04. Retrieved 2012-09-03.
  9. Shetty, Jitesh; Adibi, Jafar (2005). "Discovering important nodes through graph entropy the case of Enron email database". Proceedings of the 3rd international workshop on Link discovery - LinkKDD '05. pp. 74–81. doi:10.1145/1134271.1134282. ISBN   978-1595932150. S2CID   10122735.
  10. Friginal, Eric; Hardy, Jack (2013). Corpus-Based Sociolinguistics: A Guide for Students. Routledge. p. 167. ISBN   978-1-136-29277-4 . Retrieved 29 May 2020.