Date 2023-10-08
Title: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

Authors: Joanna Hong, Se Jin Park, and Yong Man Ro

We present a novel approach to multilingual audio-visual speech recognition tasks by in troducing a single model on a multilingual  dataset. Motivated by the human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, the proposed model can capture which language is given as an input speech by distinguishing the inherent similarities and  differences between languages. To do so, we  design prompt fine-tuning into the largely pretrained audio-visual  representation model in  order to provide language information, both  label and nuance. Thus, the network can predict the correct speech with the correct language. To verify the effectiveness of the pro posed model, we conduct experiments on a mul tilingual audio-visual corpus, namely MuAViC, containing 9 languages. Our work contributes  to developing more robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.