Abstract:
In malicious portable document format (PDF) detection based on machine learning, the existing features are easy to be confused or escaped. In order to improve the accuracy and robustness of features, this paper studies and improves the feature extraction method based on the existing methods. Combining the content features, structure features and indirect structure features of document object model (DOM) trees, the feature is selected by analyzing the importance of features and finally the malicious PDF detection is realized by using classification algorithm. The structural features are the number of leaf nodes with high-frequency. Content features includes metadata features, byte entropy, stream byte ratio, etc. The improved feature extraction method can avoid the problems of confusion and escape, and improve the accuracy and robustness of features. In the experiments, we extracted and analyzed features from the collected dataset, 58-dim features with high-importance were selected. Then we used LightGBM algorithm to train gradient boosting decision tree. The testing accuracy of this model reaches 99.9%, which is superior to the other methods. In addition, the features of some adversarial samples are simulated, and the detection accuracy is about 99.2%.