This catalogue lists resources developed by faculty and students of the Language Technologies Institute.




Natural Language Processing/Computational Linguistics

Tools

MAP-3.1
a morphological analyser and lexicon system.
HyperGREX
a simple, elegant but feature-filled rule extractor for a variety of syntactic translation formalisms (for use with cdec).
The CMU cross-lingual metaphor detector
a toolkit for identifying instances of figurative language in English and any other language for which a bilingual dictionary is available.
fast_align
a very fast—but pretty effective—unsupervised bilingual word aligner.
creg
a small and fast toolkit for large-scale linear, logistic, and ordinal regression modeling.
SEMAFOR
an open-source statistical frame-semantic parser.
AD3
an approximate MAP decoder.
Language Factory
a set of tools for building language models for many projects.
CMU Statistical Language Modeling Toolkit
language modeling for large amounts of training data.
cdec
a fast, mature decoder, alignment, and modeling toolkit for statistical machine translation and similar structure-prediction problems.
kenlm
KenLM is a fast and low-memory language modeling toolkit that scales to trillions of words.
Language-Aware String Extractor (LA-Strings)
a utility to extract strings of text from arbitrary files and identify their languages.
AMALGrAM
an open-source statistical analyzer for multiword expressions and supersenses in context.

Libraries

NER.pm
open source Perl implementation of SYNERGY NER system.
The KANTOO (kahn'-toe) project
an object-oriented C++ implementation of KANT technology for machine translation.
TurboParser
an open-source, trainable statistical dependency parser.
MSTParserStacked
an open-source, trainable statistical dependency parser based on stacking.
DAGEEM
code for unsupervised dependency grammar induction.
gappy pattern models
code for modeling monolingual and bilingual textual patterns with gaps.
Rampion
a training algorithm for statistical machine translation models.
Twitter NLP
a fast and robust Java-based tokenizer and part-of-speech tagger for Twitter, its training data of manually labeled POS annotated tweets, a web-based annotation tool, and hierarchical word clusters from unlabeled tweets.
Language Factory
a set of tools for building language models for many projects.
kenlm
KenLM is a fast and low-memory language modeling toolkit that scales to trillions of words.
Language-Aware String Extractor (LA-Strings)
a utility to extract strings of text from arbitrary files and identify their languages.

Web Services

Language Factory
a set of tools for building language models for many projects.

Data

universal part-of-speech tagset
set of twelve coarse POS tags that generalizes across several languages.
paper Data sets
for the paper "Crowdsourced Comprehension: Predicting Prerequisite Structure in Wikipedia" with Partha Talukdar from BEA-2012.
Signature and Reply Dataset
617 messages from 20 Newsgroups, annotated for reply bodies and signatures prepared by Vitor Carvalho.
Email Datasets
two subsets of the Enron data, annotated with person names prepared by Einat Minkov.
Enron email dataset
(400Mb, once you get there) contains 800,000+ emails from 150 users+ organized into 4700+ folders.
A collection of various extraction datasets in Minorthird format
including about 1000 Enron emails tagged for person names and temporal expressions.
English adjective supersenses
a 13-class supersense taxonomy of English adjectives developed by Yulia Tsvetkov.
STRAND
parallel text collections from the web.
CURD
the Carnegie Mellon University Recipe Database.
10-K Corpus
company annual reports and stock return volatility data.
political blog corpus
data from five American political blogs during 2007–2008 (released May 29, 2009).
LTI LangID Corpus
a corpus of hundreds of languages for training language-identification programs.

Speech Processing

Tools

Flite
a small fast run-time speech synthesis engine. Providing fast resource-light scalable speech synthesis for speech technology applications.
Festvox
documentation, tools and techniques for building synthetic voices English and other languages.
Ariadne
spoken Dialog System. Domain independent dialog toolkit for building systems to control applications by speech, runs unders Windows (uses SAPI).
Festival Speech Synthesis System
a general purpose text to speech system offering both a development environment for synthesis techniques and a robust multi-lingual text to speech system.
CMU Sphinx
a speaker-independent large vocabulary continuous speech recognizer. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.
Language Factory
a set of tools for building language models for many projects.
PDNN
a lightweight deep learning toolkit developed under the Theano environment.
Janus Recognition Toolkit (JRTk)
a general-purpose speech recognition toolkit.
Eesen
Eesen is a toolkit to build speech recognition (ASR) systems in a completely end-to-end fashion..

Libraries

speechlink
an application-layer control protocol for transferring callers between cooperating speech applications, from Scansoft.
Language Factory
a set of tools for building language models for many projects.

Web Services

The CMU Dictionary
a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions.
Language Factory
a set of tools for building language models for many projects.

Data

CMU Chaplain
conversational speech, 4.15 hours, close-talking microphone, 16bit, 16KHz. Hand transcribed. Recorded as role-playing between US Army Chaplains as part of the Tongues Audio Voice Translation project..
Speech Recognition and Synthesis
Speech recognition and synthesis materials.

Others

Interaction in Virtual Worlds VM
Live Speech and Dialog in a Virtual Machine.
TEDLIUM Speech Recognition VM
a VM that contains a complete setup for training and testing a speech recognizer.

Information Retrieval, Text Mining and Analytics

Tools

LightSIDE
Text Mining Toolkit.
TagHelper Tools
automating the Analysis of Conversational Data.
Indri
search engine that provides state-of-the-art text search and a rich structured query language for text collections of up to 50 million documents (single machine) or 500 million documents (distributed search).

Libraries

Ramnath Balasubramanyan's BlockLDA
regularized latent variable mixed membership modeling.
Ni Lao's PRA method
Relational Retrieval Using a Combination of Path-Constrained Random Walks.
MultiRandomWalk
code for MultiRandomWalk.
Minorthird
an open-source Java package of information extraction and text classification learning tools.
SecondString
open-source Java package, of approximate string matching techniques.

Web Services

Search services for TREC datasets
Search services for TREC datasets.
ClueWeb09 search services
ClueWeb09 search services: interactive, batch, page rendering, and attribute lookup.
ClueWeb12 search services
ClueWeb12 search services: interactive, batch, page rendering, and attribute lookup.

Data

ClueWeb12
A 27 terabyte dataset of about 733 million English web pages, collected between February 10, 2012 and May 10, 2012.

Multimedia

Data

CMU Viral Videos
the largest public viral video dataset collected at Carnegie Mellon University.

Machine Learning

Tools

LightSIDE
Text Mining Toolkit.
PDNN
a lightweight deep learning toolkit developed under the Theano environment.
Cunei Machine Translation Platform
a hybrid machine translation system which puts example-based machine translation on a full statistical-MT footing.
hblr
Bayesian model for large-scale hierarchical classification.
ParallelMLR
Parallel multiclass logistic regression.
TCS
tool for transformation-based probabilistic clustering with supervision.

Libraries

Minorthird
an open-source Java package of information extraction and text classification learning tools.

Data

LTI LangID Corpus
a corpus of hundreds of languages for training language-identification programs.
TEACHER Dataset
educational datasets with course prerequisite links collected from various universities.

Machine Translation

Tools

Meteor
an automatic metric for evaluating and optimizing machine translation systems.
TransCenter
a web-based translation post-editing environment for use with cdec Realtime.
kenlm
KenLM is a fast and low-memory language modeling toolkit that scales to trillions of words.
CMU MEMT
a system to combine the outputs of multiple translation systems into a single improved translation.
CMU-EBMT
a complete end-to-end example-based machine translation system, including all tools to train a translation system from parallel text.
Cunei Machine Translation Platform
a hybrid machine translation system which puts example-based machine translation on a full statistical-MT footing.

Libraries

The KANTOO (kahn'-toe) project
an object-oriented C++ implementation of KANT technology for machine translation.
kenlm
KenLM is a fast and low-memory language modeling toolkit that scales to trillions of words.

Spoken Interfaces and Dialogue Processing

Tools

Olympus 2.5
a dialog toolkit for building Ravenclaw-based dialog systems.

Libraries

Olympus 2.5
a dialog toolkit for building Ravenclaw-based dialog systems.

Data

Let's Go Data
We distribute data totaling over 150,000 dialogs.

Other

Tools

BNT-SM
Bayes Net Toolbox for Student Modeling.
4CAPS
4CAPS Cognitive Neuroarchitecture.
EEG-ML
distribution of EEG-ML toolkit and Reading Tutor EEG data.
Bazaar
Reconfigurable Multi-Party Dialogue Environment.
ZipRec
a tool to recover data from damaged compressed files (ZIP archives, gzip files, and other formats using DEFLATE compression).
filetype-identifier
Statistical learning for file-type identification.

Data

Science 2008 fMRI data
distribution of the 60 Word-Pic fMRI data used in the Science paper.


Click here for information and statistics about the catalogue. If you would like to add a resource, please fill in this form. Send comments and bugs to lti.catalogue AT gmail.com