This directory contains the evaluation scripts and benchmarks for Ross. It includes a wide range of benchmarks to assess the model's performance across various tasks and domains.
We utilized VLMEvalKit to evaluate Ross on:
- POPE
- HallusionBench
- MMBench (English and Chinese)
- SEED
- MMMU
- AI2D
- OCRBench
- RealWorldQA
cd VLMEvalKit
pip install -r requirements.txt
CUDA_LAUNCH_BLOCKING=1 torchrun --nproc-per-node=8 run.py \
--data POPE HallusionBench MMBench_DEV_EN MMBench_DEV_CN SEEDBench_IMG MMMU_DEV_VAL AI2D_TEST OCRBench RealWorldQA \
--model ross-qwen2-7b, ross-vicuna-13b \
--judge exact_matching
We utilized lmms-eval to evaluate Ross on:
- ChartQA
- DocVQA
- InfoVQA
- TextVQA
- GQA
- MMLU
- HellaSwag
- IFEval
cd lmms-eval
pip install -e .
python -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model ross \
--model_args "pretrained=HaochenWang/ross-qwen2-7b,conv_template=qwen_2,device_map=auto" \
--tasks chartqa,docvqa_val,infovqa_val,textvqa_val,gqa,mmlu,hellaswag,ifeval \
--batch_size 1 \
--log_samples \
--output_path ./results/ross-qwen2-7b
The evaluation on MMVP is implemented based on Cambrian-1.
cd MMVP
python mmvp_eval.py \
--model_path HaochenWang/ross-qwen2-7b \
--conv_mode qwen_2 \
--answers_file ./answers/ross-qwen2-7b.jsonl
python mmvp_test.py \
--answers_file ./answers/ross-qwen2-7b.jsonl \
--csv_file ./all_results.csv