Discrepancy in ChartQA numbers reported for InternVL2-8B #1

varadgunjal · 2024-10-31T02:08:34Z

Firstly, I wanted to say the paper was a great read. Thank you for the excellent work!

I noticed that Table 4 reports the ChartQA performance on InternVL2-8B as 73.80, whereas the InternVL website reports it as 83.3? Can you comment on this discrepancy?

hewei2001 · 2024-10-31T16:38:06Z

Hi there! 😊

Thank you so much for your kind words about our work and for taking the time to bring up this question. It’s always great to see people engaging with the details of the paper.

As for the discrepancy you noticed in the ChartQA results, we think the difference likely arises from variations in evaluation methods. The website you referenced used VLMEvalKit for testing, which, as far as we know, includes a prompt like “Answer the question using a single word or phrase.” However, this approach doesn’t align well with the step-by-step reasoning framework we aim for. This perspective is also shared by recent official releases from models like GPT-4o and Claude 3.5, which recommend a Zero-shot CoT (Chain-of-Thought) method. For our results, we appended “Let’s think step by step.” to encourage this type of reasoning.

Additionally, we evaluated the responses using an LLM-as-a-Judge approach based on GPT-4o, rather than relying on exact answer matches as some toolkits do. We believe this provides a more flexible and nuanced evaluation.

These combined differences in evaluation might explain the resulting gap. Hope this helps, and feel free to reach out with any further questions!

varadgunjal · 2024-10-31T17:19:32Z

Got it. Thanks for the explanation.

In that case, did you also do (or do you plan to do in the near future) an evaluation with the "original" method for ChartQA eval to do an apples-to-apples comparison with other existing results? Maybe something like prompting the model to end with a template like "The answer is C" and parsing it accordingly. I would guess to be fully fair this would also require similar CoT eval of the original InternVL model.

Perhaps it could help further validate the benefits of this data by comparing to a larger swathe of published results?

hewei2001 · 2024-11-01T04:30:14Z

Thank you for the thoughtful suggestion! 😊

At the moment, we don’t plan to adopt the "original" method for ChartQA evaluation, as we feel this approach may be a bit outdated. ChartQA was developed in early 2022, when evaluation methods were tailored to smaller models. With today’s more advanced MLLMs, these older methods might no longer capture the full capabilities of the models. We expect that many evaluation frameworks will evolve to reflect these changes, as seen with recent efforts like MathVista and CharXiv.

That said, your suggestion of prompting the model to conclude with a phrase like "The answer is C" to simplify parsing is a great idea! We'll consider incorporating that into our future work. Thank you again for your feedback—it’s very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

varadgunjal commented Oct 31, 2024

hewei2001 commented Oct 31, 2024 •

edited

Loading

varadgunjal commented Oct 31, 2024 •

edited

Loading

hewei2001 commented Nov 1, 2024

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

Comments

varadgunjal commented Oct 31, 2024

hewei2001 commented Oct 31, 2024 • edited Loading

varadgunjal commented Oct 31, 2024 • edited Loading

hewei2001 commented Nov 1, 2024

hewei2001 commented Oct 31, 2024 •

edited

Loading

varadgunjal commented Oct 31, 2024 •

edited

Loading