-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buggy Benchmark Generation #183
Comments
It's probably because the model checkpoint you're evaluating was trained on a different version of the dataset. The train/val/test split differs for each version, so you're essentially testing on the training data. |
I am using the same model checkpoint from HuggingFace for both datasets, so shouldn't the performance still be the same assuming that both datasets are on commit |
Do you know which mathlib commit was used to generate the dataset you downloaded from zenodo? You can find it out in |
So it seems that v9 was trained on commit |
So you generated a new dataset from What model checkpoint are you using? |
I am using the retriever trained on the |
That's expected. You should get ~36% only if using exactly the same dataset used for training the model. |
I agree. However, I don't understand why I get ~70% when using a generated dataset from the same commit ( |
Any thoughts @yangky11? Thank you so much for your help so far! |
I don't think it's necessarily going to be the same. If you run the current benchmark generation code twice, does it give you exactly the same dataset? |
Thank you for your response, @yangky11. It doesn't give the same dataset, but the two datasets are close enough that the two R@10 values are essentially the same. |
Description
I believe the
scripts/generate-benchmark-lean4.ipynb
is buggy. I evaluated the performance of the ReProver premise retriever on a dataset I generated from Mathlib4. I should get a recall of around 34.7 according to the paper. When I generated the dataset on the commit29dcec074de168ac2bf835a77ef68bbe069194c5
, my recall was absurdly high at around 70, consistent with other commits I tried with this generation code. However, the downloaded dataset v9 at https://zenodo.org/records/10929138 had a recall of 36, as expected. I am unsure what commit version v9 was on, but regardless I believe that this recall disparity suggests buggy code.Detailed Steps to Reproduce the Behavior
scripts/generate-benchmark-lean4.ipynb
on commit29dcec074de168ac2bf835a77ef68bbe069194c5
to reproduce LeanDojo Benchmark 4 version v9.Logs
Downloaded dataset:
Average R@1 = 13.057546943693088 %, R@10 = 35.968246863464174 %, MRR = 0.31899176730322715
Generated dataset:
Average R@1 = 28.444059618579143 %, R@10 = 69.65759521602048 %, MRR = 0.5975602059714626
Also, there are some differences between the two datasets. With this code,
I get this final output
Platform Information
The text was updated successfully, but these errors were encountered: