Jump to navigation
Who We Are
Policies & Guidelines
Learn at LTI
Explore Our Work
Back to the catalogue
Editing Email Datasets
If you have a problem in filling in this form, contact lti.catalogue AT gmail.com.
Fields marked with (
) are required.
The email is only for internal purposes.
the submission will not be considered until you confirm that you own this email.
You will receive a confirmation email upon submitting. The confirmation email might get into your junk email, so check your SPAM folder. If you do not receive an email, please contact us.
A proof for not being spam
This information will appear in the catalogue.
Natural Language Processing/Computational Linguistics
Information Retrieval, Text Mining and Analytics
Spoken Interfaces and Dialogue Processing
Keywords (comma separated, internal and not shown to public)
Direct Download Link
(If you provide a direct download link, please also provide the IP agreement, and the Required Acknowledgement)
<p>Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find<br /> a couple of email datasets, as well as a dataset of news groups text - <span style="text-decoration: underline;">annotated with personal names spans</span>.<br /> <br /> The full description of these datasets, including relevant statistics and references, is available in:<br /> <br /> Einat Minkov, <a href="http://www.rcwang.com/">Richard C. Wang</a> & <a href="http://www-2.cs.cmu.edu/%7Ewcohen/">William W. Cohen</a>,<strong> Extracting Personal Names from Emails:<br /> Applying Named Entity Recognition to Informal Text</strong>, <em>in HLT/EMNLP 2005</em> <a href="http://www.cs.cmu.edu/%7Eeinat/email.pdf">(PDF)</a> <br /> <br /> <strong>Some fast details:</strong></p> <ul> <li>The email corpora given here were extracted from the Enron corpus, made public by the Federal<br /> Agency Regulatory commission. A version of this data was later purchased by the <a href="http://www.ai.sri.com/project/CALO">CALO</a> project,<br /> and made available for research purposes.</li> <li>The first dataset, <em>'Enron-Meetings'</em>, consists of all messages located in folders named "meetings"<br /> or "calendar" (excluding a few very large files). Most of these messages are meeting related. The second<br /> subset, <em>'Enron-Random'</em>, was formed by uniformly sampling a user name (out of 158 users) and then<br /> randomly sampling an email from that user.</li> <li>As a second type of informal text, we also annotated a collection of newsgroups postings. The<br /> <em>'Newsgroups'</em> dataset was extracted from the 20Newsgroups corpus, by <a href="http://www.cs.cmu.edu/%7Evitor/">Vitor R. Carvalho</a>.</li> <li>These datasets are given here in a <a href="http://minorthird.sourceforge.net/">Minorthird</a> format (plain text, with separate labels files), as well as<br /> in a 'general' format, where the personal labels are embedded in the text using XML tags.</li> <li>The given zipped files construct a directory tree. The separation into train and test folders corresponds<br /> to the data splits described in the abovementioned paper. Further separation is for convenience purposes.</li> </ul>
Availability (e.g. source code, binary only, XML file, etc.)
Support Status (e.g. as-is, maintained, etc.)
Prerequisites (e.g. Windows XP, Java 1.6, etc.)
Required Acknowledgement (e.g. paper to cite)
might be helpful)
Contact (e.g. e-mail)
Additional Comments (internal and not shown to public)