Hit 142
Subject [IEEE TMM] CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition (by Minsu Kim) is accepted in IEEE Transactions on Multimedia
Date 2021-09-15
Title: CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition
Authors: Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in long-distance or crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. The audio value memory is guided to imprint the audio feature and the lip-video key memory is guided to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Thus, the proposed VSR framework can refine the prediction with the imprinted audio information during inference time where the audio inputs are not provided. We validate the proposed method on popular benchmark databases, LRW, LRW-1000, GRID, and LRS2. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level visual speech recognition. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR by examining and visualizing the learned representations.