Abstract:
The existing image caption methods have some problems that the caption sentences are not rich and accurate, and the model structures are complicated and difficult to train. We propose a novel end-to-end image caption method called image caption based on multimodal attention mechanism (M-AT). Firstly, it takes the keyword image feature extraction model (K-IFE) to extract better spatial features and keyword features, uses the keyword attention mechanism model (K-AT) to focus on important description words, and applies the spatial attention mechanism model (S-AT) to pay attention to more important areas of the image and simplify the model structure. The two attention mechanisms, K-AT and S-AT, can correct each other. The proposed method can generate more accurate and rich image description sentences. The experimental results on the MSCOCO data set show that the proposed method is effective, has around 2% improvement in some evaluation indicators.