a utility to extract strings of text from arbitrary files and identify their languages.

Author(s)

Ralf Brown

Description

Enhanced version of the standard Unix strings(1) program which uses language models for automatic language identification and character-set identification, supporting over 1300 languages, dozens of character encodings, and 4400+ language/encoding pairs.  The 'whatlang' language identifier is a separately-compilable module if only language identification is needed, not text extraction.

Availability

Source code in C++.  The included pre-built language models are provided in a textual format.

Support

Maintained by Ralf Brown

IP Agreement

Licenced under GNU GPL version 3.

Prerequisites

Developed and tested using GNU C++ under Linux, but should compile and run under Windows with litte or no change.

Required Acknowledgment

Please cite

Ralf D. Brown. "Finding and Identifying Text in 900+ Languages". In Digital Investigation, Volume 9 (2012), pp. S34-S43. (Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012)
DOI: 10.1016/j.diin.2012.05.004

for the full package, and

Ralf D. Brown, "Selecting and Weighting N-Grams to Identify 1100 Languages", In Proceedings of Text, Speech, and Dialogue 2013. Plzen, Czech Republic, September 2013.

for standalone language identification.


If this information is inaccurate or incomplete, please submit an update through this form.