Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #135

zfj1998 · 2025-02-25T03:12:16Z

Congratulations on the impressive work with R1-Onevision! Enhancing visual reasoning in MLLMs is certainly a key step toward improving their capabilities, and your progress in this area is commendable.

I would like to suggest expanding the evaluation of R1-Onevision to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.

Key points for consideration:

HumanEval-V expands the reasoning scenarios with complex diagrams, pushing the limits of visual understanding.
The task format is tailored to code generation, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
Evaluating R1-Onevision on this benchmark will provide valuable insights into how well it handles visual reasoning combined with coding.

You can find more information about the benchmark here: HumanEval-V Homepage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #135

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #135

zfj1998 commented Feb 25, 2025

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #135

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #135

Comments

zfj1998 commented Feb 25, 2025