mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Title: | mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition |
---|---|
Authors: | Rouditchenko, Andrew, Thomas, Samuel, Kuehne, Hilde, Feris, Rogerio, Glass, James |
Publication Year: | 2025 |
Collection: | Computer Science |
Subject Terms: | Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound |
More Details: | Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions. |
Document Type: | Working Paper |
Access URL: | http://arxiv.org/abs/2502.01547 |
Accession Number: | edsarx.2502.01547 |
Database: | arXiv |
Be the first to leave a comment!