基于改进字节对编码的汉藏机器翻译研究

Research on Chinese-Tibetan Machine Translation Model Based on Improved Byte Pair Encoding

摘要: 该文通过改进字节对编码算法，提出了带字数阈值的藏文字节对编码算法，优化了基于注意力机制的汉藏神经机器翻译模型。收集整理了100万汉藏句对和20万汉藏人名地名词典，训练了汉藏神经机器翻译模型。通过测试和验证，模型的BLEU值达到36.84。该模型的命名实体翻译效果优于已商用汉藏在线翻译系统。同时，该文的神经机器翻译模型已部署于汉藏机器翻译网站，实现了汉藏神经机器翻译系统的应用推广。

Abstract: In order to optimize Chinese-Tibetan neural machine translation (NMT) based on attention mechanism, this paper proposes a Tibetan byte-pair encoding algorithm with maximum byte threshold to improve the original byte-pair encoding algorithm. By collecting one million Chinese-Tibetan sentence pairs and dictionaries with 200, 000 Chinese-Tibetan names and places, we train the Chinese-Tibetan NMT model using attention mechanism. Our model has a better translation result in named entity compared with commercial using of Chinese-Tibetan online translation and it achieves 36.84 in bilingual evaluation understudy (BLEU) score. Our work has already deployed in Chinese-Tibetan machine translation system web which will promote the spread and application of Chinese-Tibetan NMT system.