Abstract:
In the context of the current rapid development of the Internet of Things, processing multi-modal data from various information collection devices, especially data from multi-sensory information such as visual, auditory signals and text, is crucial for the applications of machine learning. The outstanding performance of the Transformer architecture and its derived large models in natural language processing and computer vision has promoted the pursuit of complex multi-modal data processing capabilities. However, this also brings the challenges of data privacy security and meeting personalized needs. In order to solve these challenges, this paper proposes a personalized federated learning method based on multi-modal Transformer, which supports federated learning of heterogeneous data modalities, and its training is more consistent with its purpose while protecting the data privacy of the participants. The proposed method significantly improves the performance of the multi-modal personalized model, its accuracy is increased by 15% compared with the comparative method, which marks a breakthrough in the application scenario limitations of multi-modal personalized federated learning.