Wednesday Jun 18, 2025

Arxiv paper - Token-Efficient Long Video Understanding for Multimodal LLMs

In this episode, we discuss Token-Efficient Long Video Understanding for Multimodal LLMs by Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon. The paper introduces STORM, a new architecture that incorporates a temporal encoder using the Mamba State Space Model to better capture temporal dynamics in video-based multimodal large language models. This approach enables effective token reduction, significantly lowering computational costs and latency while preserving essential temporal information. Experiments demonstrate that STORM achieves state-of-the-art performance on long video understanding benchmarks with substantial improvements in efficiency and accuracy.

Comment (0)

No comments yet. Be the first to say something!

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20241125