基于FP序列树的法文词语提取方法研究

于娟; 吴晓鹏; 廖晓; 刘建国

doi:10.12178/1001-0548.2020273

基于FP序列树的法文词语提取方法研究

Extracting Terms Form French Corpora with FP Sequence Tree

摘要

摘要: 法语复杂的语法和词形变化规则导致N-gram等词语提取方法的效果无法保证，影响法语文本挖掘的准确性。该文提出一种高效的法文词语提取方法，从待分析的法语文本中自动获取包括单词和短语的词语集合，构建法语文本挖掘所需的词库。该方法把文本中的单词共现信息压缩为FP序列树结构，快速提取频繁词串并计算其成词度，得到法文词语集合。实验表明，该方法的准确率高达90%，且具有比现有法文词语提取方法更高的召回率，能有效支持法语文本挖掘应用。

Abstract: French is one of the working languages of the United Nations. Its complex grammar and part-of-speech rules result in the inability of term extraction methods such as N-gram and thus affect the accuracy of French text mining. This paper proposes an effective and efficient French term extraction method, which can be used to extract words and phrases from the analyzing French text corpora and provide a complete lexicon for French text mining. Firstly, word co-occurrence information of the corpora being analyzed is compressed into an FP (Frequent Pattern) sequence tree for extracting frequent word sequences rapidly, and then the termhood of each frequent word sequence is calculated to obtain the term set. The FP sequence tree is a newly-designed data structure for reducing the time complexity of word co-occurrence statistics to linear time. Experiments show that the proposed method has a high accuracy of approximate 90% with a much higher than normal recall rate and thus has good potentials for French text mining applications.

HTML全文

参考文献(25)

施引文献

资源附件(0)

基于FP序列树的法文词语提取方法研究

Extracting Terms Form French Corpora with FP Sequence Tree

期刊在线

编辑办公

友情链接