结合全局信息增强的医学领域命名实体识别研究

Research on Named Entity Recognition in Medical Domain with Global Information Augmentation

  • 摘要: 中文医疗问诊文本中,由于口语化的不规则表达和专业术语的频繁出现,药物名称等实体难以被精准地识别出来。为了充分利用中文句子词间关系的重要作用,提出了一种用于增强全局信息的医学命名实体识别模型。模型利用注意力机制增强了词嵌入表征,并在使用双向长短时记忆网络的序列处理能力获取上下文信息的基础上,同时从两个方面丰富了句子的全局信息表示。其一是根据句法关系获取词语之间额外依赖关系构建了图卷积网络层用于丰富词间的依赖;其二是构建了辅助任务用于预测词间句法依赖关系的类别。在中文医疗问诊数据集上的实验结果表明,模型具有很好的竞争力,F1值达到94.54%。与其他模型相比,在药物和症状等实体类别的识别上取得了明显提高。在微博公开数据集上的实验也表明,模型具有通用领域的应用价值。

     

    Abstract: Entities such as drug names are difficult to identify accurately in Chinese medical questioning texts due to the frequent occurrence of colloquial irregular expressions and jargon. To make full use of the important role of inter-word relations in Chinese sentences, a medical named entity recognition model for enhancing global information is proposed. The model enhances the word embedding representation using an attention mechanism and enriches the global information representation of sentences in two ways simultaneously, based on the use of the sequence processing capability of bidirectional long and short-term memory networks to obtain contextual information. Firstly, a graphical convolutional network layer is constructed to enrich inter-word dependencies based on syntactic relationships to obtain additional dependencies between words; secondly, an auxiliary task is constructed to predict the class of syntactic dependencies between words. Experimental results on the Chinese medical consultation dataset show that the model is very competitive, with an F1 value of 94.54%. Significant improvements are achieved in the recognition of entity classes such as drugs and symptoms compared to other models. Experiments on the Weibo public dataset also show that the model has general-domain applications.

     

/

返回文章
返回