-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638
Comments
Hi, @mantle2048 , |
Sure, I would be happy to! |
A minimalist script: scorer = AutoScoringJudge() # AutoScoring
solution = "$\\therefore$ k = \\boxed{-6}.$"
prediction = "$\\therefore$ k = \\boxed{-6}. \\\\\n k = \\boxed{-6}.$"
if "\\boxed{" in solution and "\\boxed{" in prediction:
processed_solution, processed_prediction = scorer.preprocess(solution, prediction)
print("processed_solution:", processed_solution)
print("processed_prediction:", processed_prediction)
judge = scorer.judge(solution, prediction)
print("judge:", judge)
#================================
# Output
# processed_solution: -6
# processed_prediction: -6,-6
# judge: False This should be a correct prediction, but |
Hi, @mantle2048 , |
It seems that I underestimated GPT-4o as a powerful model. The repeated output of \boxed is more common on weaker, smaller models. However, the significant performance discrepancy compared to the results reported in the original paper is quite strange. We can close this issue for now and reopen it once I have further findings. |
OK, I will also try to contact the paper authors for more information. |
Hi, everyone.
I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.
In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.
This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.
However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.
I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.
Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.
The text was updated successfully, but these errors were encountered: