Skip to content

PyTorch Implementation of "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"

Notifications You must be signed in to change notification settings

DreamMr/HR-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

🤗 Dataset | 📖 Paper

This repo contains the official code and dataset for the paper "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"

💡 Highlights

  • 🔥 We introduce HR-Bench to systematically evaluate the perception ability of MLLMs in high-resolution (8K resolution) images.
  • 🔥 We propose a training-free framework $DC^2$ to effectively enhance the MLLM's perceive ability on high-resolution images.

📜 News

[2025.02.08] 🚀 HRBench has been supported in the lmms-eval repository.

[2024.12.10] 🥳 Our work was accepted by AAAI 2025.

[2024.09.09] 🚀 HRBench has been supported in the VLMEvalKit repository.

[2024.08.29] 🚀 We released the ArXiv paper.

[2024.08.23] 🚀 Huggingface Dataset and $DC^2$ code are available!

👀 Introduction

HR-Bench

We find that the highest resolution in existing multimodal benchmarks is only 2K. To address the current lack of high-resolution multimodal benchmarks, we construct HR-Bench. HR-Bench consists two sub-tasks: Fine-grained Single-instance Perception (FSP) and Fine-grained Cross-instance Perception (FCP). The FSP task includes 100 samples, which includes tasks such as attribute recognition, OCR, visual prompting. The FCP task also comprises 100 samples which encompasses map analysis, chart analysis and spatial relationship assessment. We visualize examples of our HR-Bench.👇

HR-Bench is available in two versions: HR-Bench 8K and HR-Bench 4K. The HR-Bench 8K includes images with an average resolution of 8K. Additionally, we manually annotate the coordinates of objects relevant to the questions within the 8K image and crop these image to 4K resolution.

Divide, Conquer and Combine

We observe that most current MLLMs (e.g., LLaVA-v1.5) perceive images in a fixed resolution (e.g., $336\times336$). This simplification often leads to greater visual information loss. Based on this finding, we propose a novel training-free framework —— Divide, Conquer and Combine ($DC^2$). We first recursively split an image into image patches until they reach the resolution defined by the pretrained vision encoder (e.g., $336\times 336$), merging similar patches for efficiency (Divide). Next, we utilize MLLM to generate text description for each image patch and extract objects mentioned in the text descriptions (Conquer). Finally, we filter out hallucinated objects resulting from image division and store the coordinates of the image patches which objects appear (Combine). During the inference stage, we retrieve the related image patches according to the user prompt to provide accurate text descriptions.

🏆 Mini-Leaderboard

We show a mini-leaderboard here and please find more information in our paper. (👏🏻Any new results are welcome. Please add your results and model/paper links through an issue or pull request.)

Model HR-Bench 4K (Acc.) HR-Bench 8K (Acc.) Avg.
Human Baseline 🥇 82.0 86.8 84.4
InternVL-2-llama3-76B w/ our $DC^2$ 🥈 70.4 63.3 66.9
Qwen2VL-7B 🥉 66.8 66.5 66.6
InternVL-2-llama3-76B 71.0 61.4 66.2
Gemini 1.5 Flash 66.8 62.8 64.8
InternVL-1.5-26B w/ $DC^2$ 63.4 61.3 62.3
Qwen2VL-2B 64.0 58.6 61.3
InternVL-1.5-26B 60.6 57.9 59.3
GPT4o 59.0 55.5 57.3
QWen-VL-max 58.5 52.5 55.5
Xcomposer2-4kHD-7B 57.8 51.3 54.6
LLaVA-HR-X-13B 53.6 46.9 50.3
LLaVA-1.6-34B 52.9 47.4 50.2
QWen-VL-plus 53.0 46.5 49.8
LLaVA-HR-X-7B 52.0 41.6 46.8

📧 Contact

✒️ Citation

@article{hrbench,
      title={Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models}, 
      author={Wenbin Wang and Liang Ding and Minyan Zeng and Xiabin Zhou and Li Shen and Yong Luo and Dacheng Tao},
      year={2024},
      journal={arXiv preprint},
      url={https://arxiv.org/abs/2408.15556}, 
}

Acknowledgement

About

PyTorch Implementation of "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published