MMLU-Pro result weirdness #8450

sais-github · 2024-07-12T09:06:09Z

sais-github
Jul 12, 2024

I have been using Ollama-MMLU-Pro to run the test on both I-matrix and static Q5_K_M Llama3-8B-Instruct models. I used the static models as a baseline for comparison to the I-matrix models but found some really weird results that i don't understand. Like how each model scores so different in Computer Science

Source	biology	business	chemistry	computer science	economics	engineering	health	history	law	math	other	philosophy	physics	psychology	total
Mradermacher-Non I-matrix @ Q5_K_M Run 1	57.32	24.71	14.93	29.27	43.84	19.40	33.62	25.20	15.71	13.92	35.82	37.07	14.32	49.12	27.25
Mradermacher-I-Matrix @ Q5_K_M	54.81 -4.47%	22.94 -7.42%	13.43 -10.57%	18.78 -43.66%	35.78 -20.24%	19.92 +2.64%	32.76 -2.59%	20.47 -20.71%	15.89 +1.13%	14.43 +3.59%	40.48 +12.21%	36.07 -2.73%	11.93 -18.20%	46.87 -4.68%	25.74 -5.69%
Bartowski-I-matrix @ Q5_K_M	57.88 +0.97%	20.91 -16.65%	14.66 -1.82%	26.34 -10.53%	39.81 -9.63%	20.95 +7.68%	36.06 +7.00%	24.41 -3.18%	17.17 +8.88%	13.19 -5.38%	39.11 +8.78%	34.27 -7.84%	13.09 -8.97%	48.12 -2.05%	26.89 -1.32%
Lewdiculous-I-matrix @ Q5_K_M	58.58 +2.17%	22.81 -7.99%	14.13 -5.50%	23.17 -23.26%	40.17 -8.73%	20.74 +6.67%	36.31 +7.69%	20.73 -19.46%	17.26 +9.40%	12.14 -13.66%	39.50 +9.77%	35.67 -3.84%	13.16 -8.44%	49.75 -1.27%	26.89 -1.32%
Mradermacher-Non I-matrix @ Q5_K_M Run 2	58.44 +1.93%	23.70 -4.17%	14.93 =	28.05 -4.25%	43.72 -0.27%	18.68 -3.78%	33.74 +0.35%	23.36 -7.57%	15.71 =	12.88 -7.76%	35.61 -0.58%	36.67 -1.08%	14.32 =	49.75 +1.27%	26.99 -0.95%

"+" & "-" are the percentile differences from the score of the Non I-matrix run 1

Testing was done using LM-Studio as an api. I used all the default settings for the MMLU-Pro test only changing where the results were being saved.

I also ran Mradermacher's I-Matrix on just Computer Science twice more. Run 1 scored 18.78, Run 2 scored 20.00 and Run 3 score 18.54. For an average on 19.10. Which is 42.05% lower than the non I-matrix run 1 or 37.96% lower than non I-matrix run 2

In case anyone wants to run the MMLU-Pro themselves, It takes a while, nearly 5 hours on an RTX-2080.
One run had 15852596 prompt tokens and 651281 completion tokens

I don't really understand how the I-matrix thing works, I just thought the results were interesting and wanted to know if anyone has an idea what might be causing it.

steampunque · 2024-07-12T19:26:50Z

steampunque
Jul 12, 2024

MMLUPRO is based on CoT and CoT responses can diverge a lot and give different answers for small differences in numerical noise (butterfly effect). I exclusively run non imatrix quants because I don't trust it doesn't mung the model worse than it helps for the quants I use. I saw a couple cases where it was guaranteed munging the performance and I then dropped it completely and started doing all my own non imatrix quants.

I also think the Tigerlabs MMLUPRO prompts are not optimal. They add a whole bunch of irrelevant 5 shot garbage before the desired question which just confuses the model and screws the results up. I ran a set of custom zero shot CoT prompts on gemma 2 9B and I believe these results are actually representative of what the model can do on this extremely challenging test set. There is still pretty high variation among even closely related quants, but I think its to be expected on CoT runs.

model	gemma-2-9b-it	gemma-2-9b-it	gemma-2-9b-it
quant	Q6_K	IQ4_XS	Q4_K_M
----------------------------------------	---------------	---------------	---------------
biology	0.747	0.754	0.687
business	0.583	0.579	0.679
chemistry	0.503	0.502	0.639
computer_science	0.482	0.473	0.507
economics	0.668	0.622	0.648
engineering	0.406	0.391	0.406
health	0.545	0.561	0.535
history	0.493	0.482	0.522
law	0.343	0.356	0.353
math	0.538	0.537	0.621
other	0.551	0.528	0.542
philosophy	0.448	0.460	0.436
physics	0.501	0.491	0.494
psychology	0.647	0.666	0.657
MMLUPRO	0.528	0.524	0.553

2 replies

sais-github Jul 13, 2024
Author

It would be nice to have expected run to run variation numbers from Tigerlabs themselves. I can't find any mention in the papers or on the web regarding variation, just prompt variation like here:

By increasing the distractor numbers, we significantly reduce the probability of correct guess by chance to boost the benchmark’s robustness. Specifically, with 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro

steampunque Jul 13, 2024

It would be nice to have expected run to run variation numbers from Tigerlabs themselves. I can't find any mention in the papers or on the web regarding variation, just prompt variation like here:

There shoudlnt be any run to run variation as long as you are using greedy sampling (temp=0) which is needed on tests like this (don't want model to be creative, want its best estimate of correct answer). For deterministic results if using llama.cpp server as underlying inference engine you also need to run with 1 slot and if you are not using my prompt cache patch in server to guarantee deterministic result with prompt cache the prompt cache also need to be disabled.

For Non CoT MC I reduce error from guessing by double prompting with answer letters circularly shifted one to make sure model tracks correct answer on second prompt, on N answer questions it reduces guess hit probabiity to 1/(N**2). For all of my zero shot CoT prompts I first create a CoT context, then run a second zero shot multiturn prompt asking it to extract the answer letter from the Question+CoT and it gets one shot and one token to get the right answer, if it misses or doesn't hit a letter I score fail. Visually testing gemma 2 never fails to extract the correct answer letter out of question+context, and if model says it cant answer question or gives a none of the above and CoT doesn't contain the answer letter it responds correctly with ?, I saw in tigerlabs script they randomly select 1 of the N letters on fail (presumably to emulate a human guesser) which is also not correct, the question should have just been scored a fail if it doesnt find a letter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMLU-Pro result weirdness #8450

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

MMLU-Pro result weirdness #8450

sais-github Jul 12, 2024

Replies: 1 comment · 2 replies

steampunque Jul 12, 2024

sais-github Jul 13, 2024 Author

steampunque Jul 13, 2024

sais-github
Jul 12, 2024

Replies: 1 comment 2 replies

steampunque
Jul 12, 2024

sais-github Jul 13, 2024
Author