LSA和MD5算法在垃圾邮件过滤系统的应用研究

Research of Spam Filtering System Based on Latent Semantic Analysis and MD5

  • 摘要: 随着对垃圾邮件问题的普遍关注,针对目前邮件过滤方法中存在着的语义缺失现象和处理群发型垃圾邮件低效问题,提出一种基于潜在语义分析(LSA)和信息-摘要算法5(MD5)的垃圾邮件过滤模型。利用潜在语义分析标注垃圾邮件中潜在特征词,从而在过滤技术中引入语义分析;利用MD5在LSA分析基础上,对群发型垃圾邮件生成"邮件指纹",解决过滤技术在处理群发型垃圾邮件中低效的问题。结合该模型设计了一个垃圾邮件过滤系统。采用自选数据集对文中设计的系统进行测试评估,经与Naïve Bayes算法过滤器进行比较,证明该方法在垃圾邮件过滤上优于Naïve Bayes方法,实验结果达到了预期的效果,验证了该方法的可行性、优越性。

     

    Abstract: Along with the widespread concern of spam problem, at present, there are spam filtering system about the problem of semantic imperfection and spam filter low effect in the multi-send spam. This paper proposes a model of spam filtering which based on Latent Semantic Analysis (LSA) and Message-Digest algorithm 5 (MD5). By making use of the LSA marks the latent feature phrase in the spam, a semantic analysis is introduced into the spam filtering technique; the "e-mail fingerprint" of multi-send spam is born with MD5 on the LSA analytical foundation, the problem of filtering technique's low effect in the multi-send spam is resolved with this kind of method. We design a spam filtering system based on this model. This system is evaluated with an optional dataset. The results obtained are compared with Naïve Bayes algorithm filter experiment results. The experiments show the expected results, and the feasibility and advantage of the new spam filtering method is validated.

     

/

返回文章
返回