a utility to extract strings of text from arbitrary files and identify their languages.


Ralf Brown


Enhanced version of the standard Unix strings(1) program which uses language models for automatic language identification and character-set identification, supporting over 1300 languages, dozens of character encodings, and 4400+ language/encoding pairs.  The 'whatlang' language identifier is a separately-compilable module if only language identification is needed, not text extraction.


Source code in C++.  The included pre-built language models are provided in a textual format.


Maintained by Ralf Brown

IP Agreement

Licenced under GNU GPL version 3.


Developed and tested using GNU C++ under Linux, but should compile and run under Windows with litte or no change.

Required Acknowledgment

Please cite

Ralf D. Brown. "Finding and Identifying Text in 900+ Languages". In Digital Investigation, Volume 9 (2012), pp. S34-S43. (Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012)
DOI: 10.1016/j.diin.2012.05.004

for the full package, and

Ralf D. Brown, "Selecting and Weighting N-Grams to Identify 1100 Languages", In Proceedings of Text, Speech, and Dialogue 2013. Plzen, Czech Republic, September 2013.

for standalone language identification.

If this information is inaccurate or incomplete, please submit an update through this form.