Abstract:
Visual object tracking (VOT), as a crucial downstream task in computer vision, has consistently garnered significant attention due to its widespread applications. In recent years, adversarial attack methods for VOT have emerged, which disrupt tracker predictions by injecting adversarial perturbations into input data. However, corresponding adversarial defense approaches remain scarce and suffer from multiple limitations: inadequate defense performance against adaptive attacks, excessive computational overhead introduced by preprocessing modules, and poor transferability across heterogeneous trackers. To address these challenges, this paper proposes a feature regularization loss based on the empirical observation that adversarial features and original features exhibit divergence across different convolutional scales, aiming to achieve feature space alignment between them. Second, considering the dual-image input characteristics of visual tracking tasks, an adversarial training framework tailored for visual tracker is designed. This framework effectively guides the network to learn robust feature representations by leveraging the feature regularization loss, thereby enhancing the adversarial robustness of the tracker. Finally, comparative experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance under adaptive attack scenarios while maintaining limited accuracy degradation on clean samples. Notably, the proposed approach exhibits superior transferability across heterogeneous tracking architectures compared to existing defense methods.