Statistical learning for file-type identification.
Siddharth Gopal, Yiming Yang
Refer to http://nyc.lti.cs.cmu.edu/software/
File-type Identifier ==================== This folder contains the necessary tools to train a file-type identifier using 2-gram based features. The training files are assumed to be present in some folder (called 'data/' here for convenience) and the extension of the file is assumed to be the true file-type of the file. There are 3 steps in training; 1. Preprocess : converts a bunch of files in a given directory to the corresponding 2-gram feature representation. ./preprocess [directory without ending /] [output-training-file] [file-ext-to-number-mapping] ./preprocess data/ data.svm cmap.txt 2. Training : Trains a dual co-ordinate descent SVM with a regularized bias term  for each file-type present in the training file. For example, ./train --dataset_string=typeidentifier --default_C=100 --tw=nnc --train_path=data.svm --save_model=model-file.txt --dataset_string = [string-value] give some dummy name. --default_C = [non-negative value] regularization constant of SVM, you need to appropriately tune this parameter using cross-validation. Typical values to try would be ... 1e-5 , 5e-5 , 1e-4 , 5e-4 , 1e-3 , 5e-3 , 1e-2 , 5e-2 , 1e-1 , 5e-1 , 1 , 5 , 10 , 50 , 100 , 500 , 1000 , 5000 , 10000 .... --tw = [nnc|ltc] normalizing technique for features. Typically I found 'nnc' works better , but it depends on the dataset. --train_path = [path-to-training-file] --save_model = [path-to-save-model-file] 3. Type Identifier (testing) : This tools predicts the file-extension of some sequence of bytes using the model-file output by the training process. ./type_identifier --model_file=model-file.txt --cmap=cmap.txt /path/to/file1 /path/to/file2 ... The model-file.txt was got from step 2 and cmap.txt was got from step 1. Potential Problems ================== 1. I have not tested this on very large-files > 1GB. There could be problems of integer overflow etc. 2. This is not meant to be a heavy-duty. If you want to try this on very large datasets - feel free to shoot me an email. Siddharth Gopal Carnegie Mellon University Pittsburgh, PA  Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S. and Sundararajan, S. 2008. A dual coordinate descent method for large-scale linear SVM. In the Proceedings of the 25th International Conference on Machine learning, 408-415.
If this information is inaccurate or incomplete, please submit an update through this form.