Thursday Jun 01, 2023

CVPR 2023 - OmniMAE: Single Model Masked Pretraining on Images and Videos

In this episode we discuss OmniMAE: Single Model Masked Pretraining on Images and Videos by Authors: - Rohit Girdhar - Alaaeldin El-Nouby - Mannat Singh - Kalyan Vasudev Alwala - Armand Joulin - Ishan Misra Affiliation: - FAIR, Meta AI. The paper discusses how a common architecture can be used to train a single unified model for multiple visual modalities, namely images and videos, using masked autoencoding. The proposed vision transformer model achieves comparable or better visual representations than single-modality representations on both image and video benchmarks, without requiring any labeled data. Additionally, the model can be trained efficiently by dropping a large proportion of image and video patches. The proposed model achieves new state-of-the-art performance on the ImageNet and Something Something-v2 video benchmarks.

Comment (0)

No comments yet. Be the first to say something!