a utility to extract strings of text from arbitrary files and identify their languages.
Enhanced version of the standard Unix strings(1) program which uses language models for automatic language identification and character-set identification, supporting over 1300 languages, dozens of character encodings, and 4400+ language/encoding pairs. The 'whatlang' language identifier is a separately-compilable module if only language identification is needed, not text extraction.
Source code in C++. The included pre-built language models are provided in a textual format.
Maintained by Ralf Brown
Licenced under GNU GPL version 3.
Developed and tested using GNU C++ under Linux, but should compile and run under Windows with litte or no change.