Sunday Aug 27, 2023

arxiv Preprint - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

In this episode we discuss EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding by Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik. The paper presents EgoSchema, a benchmark dataset and evaluation metric for assessing the long-form video language understanding capabilities of vision and language systems. The dataset consists of over 5000 multiple choice question-answer pairs based on 250 hours of real video data, and the questions require selecting the correct answer from five options based on a three-minute video clip. The authors highlight that existing video understanding datasets lack long temporal structures, and they show that state-of-the-art video and language models have limitations in long-term video understanding.

Comment (0)

No comments yet. Be the first to say something!