-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model evaluation on minif2f fails? #22
Comments
Do you modify any part of our code during evaluation? What is your LEAN & mathlib version? |
There is a tokenizer issue in our code. We will update it as soon as possible. |
Thanks for your reply. We have fixed an issue where the script used incorrect prompts for the model/tokenizer. #23 Now, it can generate tactics properly. As for the pass rate issue, the search time is CPU-dependent, so you can adjust the limit for the number of iterations and running time. Our results are available in output.zip. Unzip the file, and you will see our outputs and the hyperparameters in detail. |
The code now can generate tactic normally, but I get a pass rate about 36%. I don't think search time has that much of an impact since almost all searches that didn't get stuck completed 100 iterations. I also tried doubling the search time but the performance improvement was <1%. My Lean version is 4.8.0-nightly-2024-04-14. |
I have re-evaluated MiniF2F, and it scored 103 out of 244, which is essentially the same as previously reported. Model performance can vary slightly since inference and search are not fully deterministic. For instance, the problem 'algebra_sqineq_unitcircatbpabsamblt1' was solved in the re-evaluation but not in the initial evaluation. I have also attached my evaluation script, which is almost identical to the script in the repository. One of the few differences is that I noticed vllm had changed their default value for the length_penalty back in February (vllm-project/vllm#2667), so I explicitly set it back to 0 in case you are using a newer version of vllm. As for the "state_after"s in the output.zip, they are just for better pretty printing and make no real difference to the search process. I hope this information is helpful to you. |
I use this proofsearch script and meet an error: "TypeError: init() got an unexpected keyword argument 'additional_imports'". I use lean-dojo==1.1.2 as in requirements.txt. Could you provide the the lean and lean-dojo version? |
Hi, additional_imports is a feature in newer version of lean-dojo, see ( lean-dojo/LeanDojo@d04c4b4 ). We are using lean-dojo 1.7.0 |
Please see #28 |
With lean-dojo 1.7.0, I can't build dataset successfully. Are there any changes in the data files?
|
The scripts fetch MiniF2F from [https://github.com/rah4927/lean-dojo-mew], which uses leanprover/lean4:nightly-2023-08-19. To use newer versions of Lean4, slight modifications may be needed. We provide a version of MiniF2F that works with leanprover/lean4:v4.7.0 at [https://github.com/wzj423/lean-dojo-mew]. To use a custom source for the MiniF2F dataset, change the |
ExtractData still does not work properly.
And then all Dojo(theorems) will end with EOFError. |
I have rerun the tracing process, and it runs smoothly with logs as follows:
I suspect that this is a lean-dojo issue, which is beyond the scope of this issue. However, I personally have some suggestions: lean-dojo's ExtractData.lean is known to occasionally cause some OOM issues, see [ https://github.com/lean-dojo/LeanDojo/issues/64 ]. However, one doesn't need to trace any of the dependencies (you are tracing ProofWidgets in your log, which is not needed) to use the MiniF2F dataset. One HACK is to explicitly tell lean-dojo NOT to trace the dependencies, which means changing # Copy and `cd` into the repo.
traced_repo_path = get_traced_repo_path(self.repo) to # Copy and `cd` into the repo.
traced_repo_path = get_traced_repo_path(self.repo, build_deps = False) at [https://github.com/lean-dojo/LeanDojo/blob/4d61bdd2f2b6d5ec85050798b3efc2e1234b6960/src/lean_dojo/interaction/dojo.py#L175 ], which might relieve the memory burden. (It is weird that the author of lean-dojo does not expose this API.) |
It still don't work. There are the logs.
These errors will also appear when build_deps = True. |
We cannot reproduce your problem for now. Could you provide more possible details. |
I use this code and just modify DATA="data/minif2f-lean4.7.0.jsonl" and VERBOSE="1" in eval_internLM2-plus_7b.sh. I reproduced this error in various platforms. This is the full logs: |
We are working on reproduce and fix this problem, will reply to you as soon as possible. |
It can work properly with repo https://github.com/yangky11/miniF2F-lean4. Maybe this info will help. |
What is your reproduced performance by using this repo? |
When I evaluate InternLM2-Math-Plus-7b in minif2f through this code, it fails. The model only generates one line "Here is the predicted next tactic:" without any tactics. If I let the model continue generating until I get a tactic each time, I only get a pass rate 38.1% instead of 43.4%.
The text was updated successfully, but these errors were encountered: