Sunday Jul 30, 2023

NeurIPS 2022 - Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

In this episode we discuss Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners by Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji. VidIL is a few-shot video-language learner that combines image and language models to generalize to different video-to-text tasks with limited examples. It translates video content into frame captions, object, attribute, and event phrases, and combines them into a temporal-aware template. The language model is then prompted with a few in-context examples to generate a target output. Experimental results show that VidIL outperforms supervised models on video future event prediction.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20241125