Notice

Hit 101
Subject [IEEE TASLP] Speech Reconstruction with Reminiscent Sound via Visual Voice Memory (by Joanna Hong) is accepted in IEEE Transactions on Audio Speech and Language Processing
Date 2021-10-29
Title: Speech Reconstruction with Reminiscent Sound via Visual Voice Memory
Authors: Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro


The goal of this work is to reconstruct speech from silent video, in both speaker dependent and speaker independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech through auxiliary information of audio. Hence, our method employs both video and audio information during training time, but does not require any additional auditory input in the inference time. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and speaker independent training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works. Moreover, we experiment on both multi-speaker and speaker independent settings and verify the effectiveness of the Visual Voice memory. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.