Statistical learning for file-type identification.


Siddharth Gopal, Yiming Yang

IP Agreement


File-type Identifier

This folder contains the necessary tools to train a file-type identifier using 
2-gram based features. The training files are assumed to be present in some 
folder (called 'data/' here for convenience) and the extension of the file is 
assumed to be the true file-type of the file.

There are 3 steps in training; 

1. Preprocess : converts a bunch of files in a given directory to the 
                corresponding 2-gram feature representation.

	./preprocess [directory without ending /] [output-training-file] [file-ext-to-number-mapping]
	./preprocess data/ data.svm cmap.txt
2. Training :  Trains a dual co-ordinate descent SVM with a regularized bias 
               term [1] for each file-type present in the training file. 
               For example,

	./train --dataset_string=typeidentifier --default_C=100 --tw=nnc --train_path=data.svm --save_model=model-file.txt
	--dataset_string = [string-value]
		give some dummy name.
	--default_C = [non-negative value]
		regularization constant of SVM, you need to appropriately tune this 
		parameter using cross-validation. Typical values to try would be
		... 1e-5 , 5e-5 , 1e-4 , 5e-4 , 1e-3 , 5e-3 , 1e-2 , 5e-2 , 1e-1 , 5e-1 
		, 1 , 5 , 10 , 50 , 100 , 500 , 1000 , 5000 , 10000 ....
	--tw = [nnc|ltc]
		normalizing technique for features. Typically I found 'nnc' works better
		, but it depends on the dataset.
	--train_path = [path-to-training-file]
	--save_model = [path-to-save-model-file]
3. Type Identifier (testing) : This tools predicts the file-extension of some 
          sequence of bytes using the model-file output by the training process. 

	./type_identifier --model_file=model-file.txt --cmap=cmap.txt /path/to/file1 /path/to/file2 ...
	The model-file.txt was got from step 2  and cmap.txt was got from step 1.

Potential Problems

1. I have not tested this on very large-files > 1GB. There could be problems 
   of integer overflow etc.

2. This is not meant to be a heavy-duty. If you want to try this on very 
   large datasets - feel free to shoot me an email.

Siddharth Gopal
Carnegie Mellon University
Pittsburgh, PA

[1] Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S. and Sundararajan, S. 2008. A dual coordinate descent method for large-scale linear SVM. In the Proceedings of the 25th International Conference on Machine learning, 408-415.


If this information is inaccurate or incomplete, please submit an update through this form.