MMLU-Pro result weirdness #8450
Replies: 1 comment 2 replies
-
MMLUPRO is based on CoT and CoT responses can diverge a lot and give different answers for small differences in numerical noise (butterfly effect). I exclusively run non imatrix quants because I don't trust it doesn't mung the model worse than it helps for the quants I use. I saw a couple cases where it was guaranteed munging the performance and I then dropped it completely and started doing all my own non imatrix quants. I also think the Tigerlabs MMLUPRO prompts are not optimal. They add a whole bunch of irrelevant 5 shot garbage before the desired question which just confuses the model and screws the results up. I ran a set of custom zero shot CoT prompts on gemma 2 9B and I believe these results are actually representative of what the model can do on this extremely challenging test set. There is still pretty high variation among even closely related quants, but I think its to be expected on CoT runs.
|
Beta Was this translation helpful? Give feedback.
-
I have been using Ollama-MMLU-Pro to run the test on both I-matrix and static Q5_K_M Llama3-8B-Instruct models. I used the static models as a baseline for comparison to the I-matrix models but found some really weird results that i don't understand. Like how each model scores so different in Computer Science
"+" & "-" are the percentile differences from the score of the Non I-matrix run 1
Testing was done using LM-Studio as an api. I used all the default settings for the MMLU-Pro test only changing where the results were being saved.
I also ran Mradermacher's I-Matrix on just Computer Science twice more. Run 1 scored 18.78, Run 2 scored 20.00 and Run 3 score 18.54. For an average on 19.10. Which is 42.05% lower than the non I-matrix run 1 or 37.96% lower than non I-matrix run 2
In case anyone wants to run the MMLU-Pro themselves, It takes a while, nearly 5 hours on an RTX-2080.
One run had 15852596 prompt tokens and 651281 completion tokens
I don't really understand how the I-matrix thing works, I just thought the results were interesting and wanted to know if anyone has an idea what might be causing it.
Beta Was this translation helpful? Give feedback.
All reactions