Jump to navigation
Who We Are
Policies & Guidelines
Learn at LTI
Explore Our Work
Back to the catalogue
If you have a problem in filling in this form, contact lti.catalogue AT gmail.com.
Fields marked with (
) are required.
The email is only for internal purposes.
the submission will not be considered until you confirm that you own this email.
You will receive a confirmation email upon submitting. The confirmation email might get into your junk email, so check your SPAM folder. If you do not receive an email, please contact us.
A proof for not being spam
This information will appear in the catalogue.
Natural Language Processing/Computational Linguistics
Information Retrieval, Text Mining and Analytics
Spoken Interfaces and Dialogue Processing
Keywords (comma separated, internal and not shown to public)
Direct Download Link
(If you provide a direct download link, please also provide the IP agreement, and the Required Acknowledgement)
<p>Siddharth Gopal, Yiming Yang</p>
Availability (e.g. source code, binary only, XML file, etc.)
Support Status (e.g. as-is, maintained, etc.)
Prerequisites (e.g. Windows XP, Java 1.6, etc.)
Required Acknowledgement (e.g. paper to cite)
might be helpful)
<p>Refer to <a href="http://nyc.lti.cs.cmu.edu/software/">http://nyc.lti.cs.cmu.edu/software/</a></p>
Contact (e.g. e-mail)
<pre id="code" class="brush: text; plain-text">File-type Identifier ==================== This folder contains the necessary tools to train a file-type identifier using 2-gram based features. The training files are assumed to be present in some folder (called 'data/' here for convenience) and the extension of the file is assumed to be the true file-type of the file. There are 3 steps in training; 1. Preprocess : converts a bunch of files in a given directory to the corresponding 2-gram feature representation. ./preprocess [directory without ending /] [output-training-file] [file-ext-to-number-mapping] ./preprocess data/ data.svm cmap.txt 2. Training : Trains a dual co-ordinate descent SVM with a regularized bias term  for each file-type present in the training file. For example, ./train --dataset_string=typeidentifier --default_C=100 --tw=nnc --train_path=data.svm --save_model=model-file.txt --dataset_string = [string-value] give some dummy name. --default_C = [non-negative value] regularization constant of SVM, you need to appropriately tune this parameter using cross-validation. Typical values to try would be ... 1e-5 , 5e-5 , 1e-4 , 5e-4 , 1e-3 , 5e-3 , 1e-2 , 5e-2 , 1e-1 , 5e-1 , 1 , 5 , 10 , 50 , 100 , 500 , 1000 , 5000 , 10000 .... --tw = [nnc|ltc] normalizing technique for features. Typically I found 'nnc' works better , but it depends on the dataset. --train_path = [path-to-training-file] --save_model = [path-to-save-model-file] 3. Type Identifier (testing) : This tools predicts the file-extension of some sequence of bytes using the model-file output by the training process. ./type_identifier --model_file=model-file.txt --cmap=cmap.txt /path/to/file1 /path/to/file2 ... The model-file.txt was got from step 2 and cmap.txt was got from step 1. Potential Problems ================== 1. I have not tested this on very large-files > 1GB. There could be problems of integer overflow etc. 2. This is not meant to be a heavy-duty. If you want to try this on very large datasets - feel free to shoot me an email. Siddharth Gopal Carnegie Mellon University Pittsburgh, PA  Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S. and Sundararajan, S. 2008. A dual coordinate descent method for large-scale linear SVM. In the Proceedings of the 25th International Conference on Machine learning, 408-415. </pre> <p> </p>
Additional Comments (internal and not shown to public)