Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
This repo contains the official code and dataset for the paper "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"
- 🔥 We introduce HR-Bench to systematically evaluate the perception ability of MLLMs in high-resolution (8K resolution) images.
- 🔥 We propose a training-free framework
$DC^2$ to effectively enhance the MLLM's perceive ability on high-resolution images.
[2025.02.08] 🚀 HRBench has been supported in the lmms-eval repository.
[2024.12.10] 🥳 Our work was accepted by AAAI 2025.
[2024.09.09] 🚀 HRBench has been supported in the VLMEvalKit repository.
[2024.08.29] 🚀 We released the ArXiv paper.
[2024.08.23] 🚀 Huggingface Dataset and
We find that the highest resolution in existing multimodal benchmarks is only 2K. To address the current lack of high-resolution multimodal benchmarks, we construct HR-Bench. HR-Bench consists two sub-tasks: Fine-grained Single-instance Perception (FSP) and Fine-grained Cross-instance Perception (FCP). The FSP task includes 100 samples, which includes tasks such as attribute recognition, OCR, visual prompting. The FCP task also comprises 100 samples which encompasses map analysis, chart analysis and spatial relationship assessment. We visualize examples of our HR-Bench.👇
HR-Bench is available in two versions: HR-Bench 8K and HR-Bench 4K. The HR-Bench 8K includes images with an average resolution of 8K. Additionally, we manually annotate the coordinates of objects relevant to the questions within the 8K image and crop these image to 4K resolution.
We observe that most current MLLMs (e.g., LLaVA-v1.5) perceive images in a fixed resolution (e.g.,
We show a mini-leaderboard here and please find more information in our paper. (👏🏻Any new results are welcome. Please add your results and model/paper links through an issue or pull request.)
Model | HR-Bench 4K (Acc.) | HR-Bench 8K (Acc.) | Avg. |
---|---|---|---|
Human Baseline 🥇 | 82.0 | 86.8 | 84.4 |
InternVL-2-llama3-76B w/ our |
70.4 | 63.3 | 66.9 |
Qwen2VL-7B 🥉 | 66.8 | 66.5 | 66.6 |
InternVL-2-llama3-76B | 71.0 | 61.4 | 66.2 |
Gemini 1.5 Flash | 66.8 | 62.8 | 64.8 |
InternVL-1.5-26B w/ |
63.4 | 61.3 | 62.3 |
Qwen2VL-2B | 64.0 | 58.6 | 61.3 |
InternVL-1.5-26B | 60.6 | 57.9 | 59.3 |
GPT4o | 59.0 | 55.5 | 57.3 |
QWen-VL-max | 58.5 | 52.5 | 55.5 |
Xcomposer2-4kHD-7B | 57.8 | 51.3 | 54.6 |
LLaVA-HR-X-13B | 53.6 | 46.9 | 50.3 |
LLaVA-1.6-34B | 52.9 | 47.4 | 50.2 |
QWen-VL-plus | 53.0 | 46.5 | 49.8 |
LLaVA-HR-X-7B | 52.0 | 41.6 | 46.8 |
- Wenbin Wang: [email protected]
@article{hrbench,
title={Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models},
author={Wenbin Wang and Liang Ding and Minyan Zeng and Xiabin Zhou and Li Shen and Yong Luo and Dacheng Tao},
year={2024},
journal={arXiv preprint},
url={https://arxiv.org/abs/2408.15556},
}
- This work is built upon the VLMEvalKit