基于多模态注意力机制的图像理解描述新方法

A Novel End-to-End Image Caption Based on Multimodal Attention

  • 摘要: 针对现有的图像理解描述方法存在描述句子不丰富、不准确、模型结构复杂、难以训练等问题,该文提出了一种端到端的基于多模态注意力机制(M-AT)的图像理解描述新方法。该方法首先通过关键词图像特征提取模型(K-IFE)提取更优的空间特征和关键词特征,并利用关键词注意力机制模型(K-AT)关注重要描述词语、空间注意机制模型(S-AT)关注图像更重要的区域并简化模型结构,且K-AT和S-AT两种注意力机制可以相互矫正,最终生成更加准确、丰富的图像描述语句。在MSCOCO数据集的实验结果表明该方法是有效的,部分评价指标有2%左右的提升。

     

    Abstract: The existing image caption methods have some problems that the caption sentences are not rich and accurate, and the model structures are complicated and difficult to train. We propose a novel end-to-end image caption method called image caption based on multimodal attention mechanism (M-AT). Firstly, it takes the keyword image feature extraction model (K-IFE) to extract better spatial features and keyword features, uses the keyword attention mechanism model (K-AT) to focus on important description words, and applies the spatial attention mechanism model (S-AT) to pay attention to more important areas of the image and simplify the model structure. The two attention mechanisms, K-AT and S-AT, can correct each other. The proposed method can generate more accurate and rich image description sentences. The experimental results on the MSCOCO data set show that the proposed method is effective, has around 2% improvement in some evaluation indicators.

     

/

返回文章
返回