KAIST Image and Video Systems lab

Notice

Hit	407
Subject	[Interspeech 2023] Intelligible Lip-to-speech Synthesis with Speech Units (by Jeongsoo Choi and Minsu Kim) is accepted in Interspeech 2023
Name	관리자
Date	2023-05-19
TItle: Intelligible Lip-to-speech Synthesis with Speech Units Authors: Jeongsoo Choi and Minsu Kim and Yong Man Ro In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target of the proposed L2S model. Therefore, the proposed L2S model is trained to generate multi-target, mel-spectrogram and speech units. As the speech units are discrete representations while mel-spectrogram is continuous, the proposed multi-target L2S model can be trained with strong content supervision, even without using text-labeled data. Moreover, to accurately convert the synthesized mel-spectrogram into a waveform, we introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units. Evaluation results confirm the effectiveness of the proposed method.