Wednesday Aug 28, 2024
arxiv preprint - Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
In this episode, we discuss Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model by Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy. The paper introduces Transfusion, a method for training multi-modal models using a combination of language modeling and diffusion on mixed-modality sequences. Transfusion models, with up to 7B parameters, show superior scaling and performance on uni- and cross-modal benchmarks compared to traditional image token quantization methods. Additionally, the use of modality-specific encoding and decoding layers allows for significant improvements, enabling high-quality image and text generation.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.