a corpus of hundreds of languages for training language-identification programs.


Ralf Brown


This is a corpus of training and test data for language identification.  The initial release contains data for modeling 781 languages, with samples (some very tiny) for an additional 310 languages.


Raw data, with scripts to unpack and spilt the data into training and test sets.


Maintained.  Release 2 is in preparation.

IP Agreement

Text is licenced under Creative Commons or public domain.  The included scripts are licensed under GNU GPL version 3.

Required Acknowledgment

Please cite

  Ralf D. Brown, "Non-linear Mapping for Improved Identification of 1300+ Languages." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014).



