Jump to navigation
This is a corpus of training and test data for language identification. The initial release contains data for modeling 781 languages, with samples (some very tiny) for an additional 310 languages.
Raw data, with scripts to unpack and spilt the data into training and test sets.
Maintained. Release 2 is in preparation.
Text is licenced under Creative Commons or public domain. The included scripts are licensed under GNU GPL version 3.
Ralf D. Brown, "Non-linear Mapping for Improved Identification of 1300+ Languages." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014).