Jump to navigation
Who We Are
Policies & Guidelines
Learn at LTI
Explore Our Work
Back to the catalogue
Editing Language-Aware String Extractor (LA-Strings)
If you have a problem in filling in this form, contact lti.catalogue AT gmail.com.
Fields marked with (
) are required.
The email is only for internal purposes.
the submission will not be considered until you confirm that you own this email.
You will receive a confirmation email upon submitting. The confirmation email might get into your junk email, so check your SPAM folder. If you do not receive an email, please contact us.
A proof for not being spam
This information will appear in the catalogue.
Natural Language Processing/Computational Linguistics
Information Retrieval, Text Mining and Analytics
Spoken Interfaces and Dialogue Processing
Keywords (comma separated, internal and not shown to public)
Direct Download Link
(If you provide a direct download link, please also provide the IP agreement, and the Required Acknowledgement)
<p>Enhanced version of the standard Unix strings(1) program which uses language models for automatic language identification and character-set identification, supporting over 1300 languages, dozens of character encodings, and 4400+ language/encoding pairs. The 'whatlang' language identifier is a separately-compilable module if only language identification is needed, not text extraction.</p>
Availability (e.g. source code, binary only, XML file, etc.)
<p>Source code in C++. The included pre-built language models are provided in a textual format.</p>
Support Status (e.g. as-is, maintained, etc.)
<p>Maintained by Ralf Brown</p>
Prerequisites (e.g. Windows XP, Java 1.6, etc.)
<p>Developed and tested using GNU C++ under Linux, but should compile and run under Windows with litte or no change.</p>
Required Acknowledgement (e.g. paper to cite)
<p>Please cite</p> <p><a><strong>Ralf D. Brown</strong>. "Finding and Identifying Text in 900+ Languages". In <em>Digital Investigation</em>, Volume 9 (2012), pp. S34-S43. (Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012) <br /> <strong>DOI:</strong> 10.1016/j.diin.2012.05.004 </a></p> <p>for the full package, and</p> <p><strong>Ralf D. Brown</strong>, "Selecting and Weighting N-Grams to Identify 1100 Languages", In <em>Proceedings of Text, Speech, and Dialogue 2013</em>. Plzen, Czech Republic, September 2013.</p> <p>for standalone language identification.</p>
might be helpful)
<p>Licenced under GNU GPL version 3.</p>
Contact (e.g. e-mail)
<p>ralf @ cs.cmu.edu</p>
Additional Comments (internal and not shown to public)