基于Aho-Corasick自动机算法的概率模型中文分词CPACA算法

A Probability Model Chinese Word Segmentation Algorithm Based on Aho-Corasick Automata Algorithm

  • 摘要: Aho-Corasick自动机算法是著名的多模式串匹配算法,它在模式串失配时,通过fail指针转移至有效的后续状态,存在一个或多个有效的后续状态可能。据此特性,该文提出了一种适应于中文分词的自动机算法。该算法使用动态规划的方法,计算上下文匹配概率,转移至最佳的有效后续状态,即实现了基于字符串匹配的机械分词方法与基于统计概率模型的方法结合。实验结果表明,该算法分词准确率高。

     

    Abstract: Aho-Corasick automata algorithm is a famous multi-string matching algorithm, which backtracks to the effective subsequence state through the fail pointer when it fails in one pattern matching, where one or more effective subsequent states may exist. According to the above characteristics, this paper proposes an automata algorithm suitable for Chinese segmentation. The algorithm calculates the context matching probability of the current pattern by dynamic programming method, and backtracks to the optimal subsequent state of maximum probability, namely, it can realize the combination of the mechanical Chinese segmentation and statistics and probability model. The experimental result shows that a high accuracy rate in Chinese segmentation can be obtained through this algorithm.

     

/

返回文章
返回