Sunday Jul 23, 2023

CVPR 2023 - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal self-attention, handles different granularities of visual concepts, and performs iterative selection and attention across layers. Experimental results demonstrate that MIST achieves state-of-the-art performance while being computationally efficient and interpretable.

Comment (0)

No comments yet. Be the first to say something!