Sunday Jul 23, 2023
CVPR 2023 - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal self-attention, handles different granularities of visual concepts, and performs iterative selection and attention across layers. Experimental results demonstrate that MIST achieves state-of-the-art performance while being computationally efficient and interpretable.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.