
Saturday May 06, 2023
CVPR 2023 - Vid2Seq: Large-Scale Pretraining of a Visual Language Model
In this episode we discuss Vid2Seq: Large-Scale Pretraining of a Visual Language Model by Authors: - Antoine Yang - Arsha Nagrani - Paul Hongsuck Seo - Antoine Miech - Jordi Pont-Tuset - Ivan Laptev - Josef Sivic - Cordelia Schmid Affiliations: - Antoine Yang: †Inria Paris and D´epartement d’informatique de l’ENS, CNRS, PSL Research University - Arsha Nagrani, Paul Hongsuck Seo, Jordi Pont-Tuset, and Cordelia Schmid: §Google Research - Antoine Miech: ♯ DeepMind - Ivan Laptev: †Inria Paris and D´epartement d’informatique de l’ENS, CNRS, PSL Research University - Josef Sivic: ¶Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague. The paper presents Vid2Seq, a single-stage dense event captioning model that uses multi-modal input and is pretrained on narrated videos. The model augments a language model with time tokens to predict event boundaries and textual descriptions in the same output sequence. The authors leverage unlabeled narrated videos for this task by reformulating sentence boundaries as pseudo event boundaries and using transcribed speech sentences as event captions. Vid2Seq improves the state of the art on various benchmarks and generalizes well to video paragraph and clip captioning tasks in few-shot settings.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.