面向视觉对话的自适应视觉记忆网络

Adaptive Visual Memory Network for Visual Dialog

摘要: 视觉对话中最具挑战的难点是视觉共指消解问题，该文针对此问题设计了一种自适应视觉记忆网络(AVMN)。该方法直接将视觉信息存储于外部记忆库，整合了文本和视觉定位过程，进而有效缓解了在这两个过程中所产生的误差。此外在很多场景下，仅依据图片便可对提出的问题进行回答，历史信息反而会导致不必要的误差。因此，模型自适应地读取外部视觉记忆，并融合了残差视觉信息。实验证明，相比于其他方法，该模型在各项指标上均取得了更优的效果。

Abstract: The key challenge in visual dialogs is the problem of visual co-reference resolution. This paper proposes an adaptive visual memory network (AVMN), which applies external memory bank to directly store grounded visual information. The textual and visual positioning processes are integrated so that the possible errors in the two processes are effectively relieved. Moreover, the answers can be produced only based on the question and image in many cases. The historical information somewhat causes unnecessary errors, so we adaptively read the external visual memory. Furthermore, a residual queried image is fused with the attended memory. The experiment indicates that our proposed method outperforms the recent approaches on the evaluation metrics.