Sunday Jul 23, 2023

CVPR 2023 - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal self-attention, handles different granularities of visual concepts, and performs iterative selection and attention across layers. Experimental results demonstrate that MIST achieves state-of-the-art performance while being computationally efficient and interpretable.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20241125