Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

Open
varadgunjal opened this issue Oct 31, 2024 · 3 comments
Open

Discrepancy in ChartQA numbers reported for InternVL2-8B #1

varadgunjal opened this issue Oct 31, 2024 · 3 comments

Comments

@varadgunjal
Copy link

Firstly, I wanted to say the paper was a great read. Thank you for the excellent work!

I noticed that Table 4 reports the ChartQA performance on InternVL2-8B as 73.80, whereas the InternVL website reports it as 83.3? Can you comment on this discrepancy?

@hewei2001
Copy link
Owner

hewei2001 commented Oct 31, 2024

Hi there! 😊

Thank you so much for your kind words about our work and for taking the time to bring up this question. It’s always great to see people engaging with the details of the paper.

As for the discrepancy you noticed in the ChartQA results, we think the difference likely arises from variations in evaluation methods. The website you referenced used VLMEvalKit for testing, which, as far as we know, includes a prompt like “Answer the question using a single word or phrase.” However, this approach doesn’t align well with the step-by-step reasoning framework we aim for. This perspective is also shared by recent official releases from models like GPT-4o and Claude 3.5, which recommend a Zero-shot CoT (Chain-of-Thought) method. For our results, we appended “Let’s think step by step.” to encourage this type of reasoning.

Additionally, we evaluated the responses using an LLM-as-a-Judge approach based on GPT-4o, rather than relying on exact answer matches as some toolkits do. We believe this provides a more flexible and nuanced evaluation.

These combined differences in evaluation might explain the resulting gap. Hope this helps, and feel free to reach out with any further questions!

@varadgunjal
Copy link
Author

varadgunjal commented Oct 31, 2024

Got it. Thanks for the explanation.

In that case, did you also do (or do you plan to do in the near future) an evaluation with the "original" method for ChartQA eval to do an apples-to-apples comparison with other existing results? Maybe something like prompting the model to end with a template like "The answer is C" and parsing it accordingly. I would guess to be fully fair this would also require similar CoT eval of the original InternVL model.

Perhaps it could help further validate the benefits of this data by comparing to a larger swathe of published results?

@hewei2001
Copy link
Owner

Thank you for the thoughtful suggestion! 😊

At the moment, we don’t plan to adopt the "original" method for ChartQA evaluation, as we feel this approach may be a bit outdated. ChartQA was developed in early 2022, when evaluation methods were tailored to smaller models. With today’s more advanced MLLMs, these older methods might no longer capture the full capabilities of the models. We expect that many evaluation frameworks will evolve to reflect these changes, as seen with recent efforts like MathVista and CharXiv.

That said, your suggestion of prompting the model to conclude with a phrase like "The answer is C" to simplify parsing is a great idea! We'll consider incorporating that into our future work. Thank you again for your feedback—it’s very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants