Abstract:
French is one of the working languages of the United Nations. Its complex grammar and part-of-speech rules result in the inability of term extraction methods such as N-gram and thus affect the accuracy of French text mining. This paper proposes an effective and efficient French term extraction method, which can be used to extract words and phrases from the analyzing French text corpora and provide a complete lexicon for French text mining. Firstly, word co-occurrence information of the corpora being analyzed is compressed into an FP (Frequent Pattern) sequence tree for extracting frequent word sequences rapidly, and then the termhood of each frequent word sequence is calculated to obtain the term set. The FP sequence tree is a newly-designed data structure for reducing the time complexity of word co-occurrence statistics to linear time. Experiments show that the proposed method has a high accuracy of approximate 90% with a much higher than normal recall rate and thus has good potentials for French text mining applications.