恶意PDF检测中的特征工程研究与改进

Research and Improvement of Feature Engineering for Malicious PDF Detection

  • 摘要: 在基于机器学习的恶意PDF检测中,现有特征容易引起混淆或逃逸。为了提高特征的准确性和鲁棒性,在现有方法的基础上研究和改进特征提取方法,结合内容特征、结构特征以及逻辑树的间接结构特征,通过分析特征重要性进行特征选择,最后应用分类算法实现恶意PDF检测。结构特征包括多个高频次叶子节点数量;内容特征包括元数据特征、字节熵值、流字节比例等特征。收集实验数据集,提取特征并分析,最终选择出58维特征,使用LightGBM算法训练梯度提升决策树模型,测试准确率为99.9%,优于其他方法。另外,模拟攻击部分样本的特征,生成对抗样本,检测准确率同样达到99.2%。

     

    Abstract: In malicious portable document format (PDF) detection based on machine learning, the existing features are easy to be confused or escaped. In order to improve the accuracy and robustness of features, this paper studies and improves the feature extraction method based on the existing methods. Combining the content features, structure features and indirect structure features of document object model (DOM) trees, the feature is selected by analyzing the importance of features and finally the malicious PDF detection is realized by using classification algorithm. The structural features are the number of leaf nodes with high-frequency. Content features includes metadata features, byte entropy, stream byte ratio, etc. The improved feature extraction method can avoid the problems of confusion and escape, and improve the accuracy and robustness of features. In the experiments, we extracted and analyzed features from the collected dataset, 58-dim features with high-importance were selected. Then we used LightGBM algorithm to train gradient boosting decision tree. The testing accuracy of this model reaches 99.9%, which is superior to the other methods. In addition, the features of some adversarial samples are simulated, and the detection accuracy is about 99.2%.

     

/

返回文章
返回