针对文本分类模型的高效硬标签对抗攻击方法

Efficient hard-label adversarial attacks against natural language processing models

  • 摘要: 为了评估自然语言处理模型在真实应用场景下的对抗鲁棒性,硬标签设置下的黑盒对抗攻击技术逐渐引发关注。然而,受限于文本的离散性、反馈信息有限、查询次数限制等因素,现有硬标签对抗攻击方法通常存在查询次数多、对抗文本语义一致性低等问题,难以满足真实应用场景需求。因此,提出了一种高效的硬标签对抗攻击方法,该方法在对抗文本初始化阶段引入注意力机制,并在对抗文本语义优化阶段中提出了基于语义聚类的同义词搜索、基于语义梯度的动态扩展同义词搜索两个策略。实验结果表明,该方法能以少量查询来生成语义一致性高、自然流畅的高质量对抗文本。

     

    Abstract: Due to the necessity of verifying the robustness of natural language processing models against adversarial attacks in real-world application scenarios, black-box adversarial attack techniques under the hard-label setting have garnered increasing attention. However, due to the discrete nature of textual data, the limited information feedback from the victim model, and the constraints on the number of queries imposed by practical applications, existing hard-label adversarial attack methods usually suffer from excessive queries to the victim model and low semantic consistency of generated adversarial texts, rendering them inadequate for real-world applications. To this end, an efficient hard label adversarial attack method is proposed. In this method, an attention mechanism is introduced in the initialization stage of the adversarial text, while in the adversarial text semantic optimization stage, two strategies are proposed: the semantic clustering-based synonym search and the semantic gradient-based dynamic expansion synonym search. Experimental results demonstrate that the proposed method can efficiently generate high-quality adversarial text with high semantic consistency and natural fluency with a small number of queries.

     

/

返回文章
返回