Jump to navigation
Who We Are
Policies & Guidelines
Learn at LTI
Explore Our Work
Back to the catalogue
Editing Enron email dataset
If you have a problem in filling in this form, contact lti.catalogue AT gmail.com.
Fields marked with (
) are required.
The email is only for internal purposes.
the submission will not be considered until you confirm that you own this email.
You will receive a confirmation email upon submitting. The confirmation email might get into your junk email, so check your SPAM folder. If you do not receive an email, please contact us.
A proof for not being spam
This information will appear in the catalogue.
Natural Language Processing/Computational Linguistics
Information Retrieval, Text Mining and Analytics
Spoken Interfaces and Dialogue Processing
Keywords (comma separated, internal and not shown to public)
Direct Download Link
(If you provide a direct download link, please also provide the IP agreement, and the Required Acknowledgement)
<p>This dataset was collected and prepared by the <a href="http://www.ai.sri.com/project/CALO">CALO Project</a> (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally <a href="http://www.salon.com/news/feature/2003/10/14/enron/index_np.html">made public, and posted to the web</a>, by the <a href="http://www.ferc.gov">Federal Energy Regulatory Commission</a> during its investigation.</p> <p>The email dataset was later purchased by <a href="http://www.ai.mit.edu/people/lpk/lpk.html">Leslie Kaelbling</a> at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably <a href="http://www.ai.sri.com/people/gervasio">Melinda Gervasio</a>, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form email@example.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to firstname.lastname@example.org when no recipient was specified.</p> <p>I get a number of questions about this corpus each week, which I am unable to answer, mostly because they deal with preparation issues and such that I just don't know about. If you ask me a question and I don't answer, please don't feel slighted.</p> <p>I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)</p> <ul> <li><span style="text-decoration: line-through;">March 2, 2004 Version of dataset</span> and the <span style="text-decoration: line-through;">August 21, 2009 Version of dataset</span> are <strong>no longer being distributed.</strong> If you are using this dataset for your work, you are requested to replace it with the newer version of the dataset below, or make the <a href="http://www.cs.cmu.edu/%7Eenron/DELETIONS.txt">the appropriate changes</a> to your local copy. A total of four messages have been removed since the original version of the dataset.</li> <li><a href="http://www.cs.cmu.edu/%7Eenron/enron_mail_20110402.tgz">August 21, 2009 Version of dataset</a> (about 423Mb, tarred and gzipped).</li> </ul> <p>There are also at least two on-line databases that allow you to search the data, at <a href="http://www.enronemail.com">Enronemail.com</a> and <a href="http://orange.sims.berkeley.edu/%7Eatf/enron/enron.cgi">UCB</a></p>
Availability (e.g. source code, binary only, XML file, etc.)
Support Status (e.g. as-is, maintained, etc.)
Prerequisites (e.g. Windows XP, Java 1.6, etc.)
Required Acknowledgement (e.g. paper to cite)
might be helpful)
Contact (e.g. e-mail)
Additional Comments (internal and not shown to public)