Thursday May 18, 2023

CVPR 2023 - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

In this episode we discuss MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation by Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo. The paper proposes a joint audio-video generation framework called Multi-Modal Diffusion (MM-Diffusion) that generates high-quality realistic videos with aligned audio. The model consists of two-coupled denoising autoencoders and a sequential multi-modal U-Net. A random-shift based attention block is used to ensure semantic consistency across modalities, enabling efficient cross-modal alignment. The model achieves superior results in unconditional audio-video generation and zero-shot conditional tasks, and Turing tests indicate dominant preferences for the model. Code and pre-trained models are available for download.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20241125