Abstract:
Existing medical image semantic segmentation methods suffer from high computational complexity, large parameter counts, suboptimal accuracy, and inability to be deployed on low-resource and clinical edge devices. To address these issues, AViT-UNet, a lightweight vision transformer U-Net model incorporating multiple attention mechanisms is proposed to reduce model size and latency while maintaining competitive segmentation performance. Firstly, a lightweight convolutional module, lightweight dilated bottleneck (LDB), is designed in this model and applied to the convolutional module of the encoding-decoding layer, which significantly reduces the computational complexity of the model. Secondly, a self-attention mechanism module, efficient multi-head attention (EMHA), is invoked and applied in the deep network and bottleneck layer to enhance the segmentation accuracy. Finally, to enhance the fidelity of skip connections and feature fusion, the network integrates channel and spatial attention mechanisms to bolster residual pathways and deepen convolutional representations, yielding more precise segmentation outputs. This strategy effectively compensates the high computational demands of transformer-based models and the limited global receptive field of conventional convolutional neural networks. As a result, the proposed lightweight architecture achieves superior semantic segmentation accuracy while remaining suitable for deployment on resource-constrained medical devices and mobile platforms. The proposed method is validated on three publicly available medical image semantic segmentation benchmark datasets, Synapse, GlaS, and MoNuSeg, with multi-dimensional evaluation metrics. Experimental results fully prove that this method has a certain degree of advancement and feasibility. The specific implementation code of the method has been uploaded to
https://github.com/shepherdxu/AViT-UNet.