Thursday Jun 27, 2024

arxiv preprint - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

In this episode, we discuss MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning by Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang. The study presents MG-LLaVA, a multi-modal large language model designed to process both low-resolution and high-resolution images along with object-centric features for improved perception tasks. It includes a high-resolution visual encoder and a Conv-Gate fusion network to amalgamate fine-grained details with base features, enhancing object recognition using bounding box-derived data from offline detectors. Extensive benchmarking demonstrates MG-LLaVA's superior performance over comparable MLLMs, validated by evaluations using various language encoders ranging from 3.8B to 34B parameters.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20240731