
Saturday May 13, 2023
CVPR 2023 - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
In this episode we discuss AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR by Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid. The paper proposes a method called AVFormer for augmenting audio-only models with visual information for audiovisual automatic speech recognition (AV-ASR). The method involves injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors, which can be trained on a small amount of weakly labelled video data with minimal additional training time and parameters. A simple curriculum scheme is also introduced during training, which is shown to be crucial for the model to jointly process audio and visual information effectively. The proposed model achieves state-of-the-art zero-shot results on three AV-ASR benchmarks while preserving decent performance on traditional audio-only speech recognition benchmarks.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.