Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we verify the generated proof? #74

Open
agoyang opened this issue Feb 24, 2025 · 1 comment
Open

How do we verify the generated proof? #74

agoyang opened this issue Feb 24, 2025 · 1 comment

Comments

@agoyang
Copy link

agoyang commented Feb 24, 2025

Thanks for the excellent work. Can you share how do you verify the correctness of the generated proof in the training set? For QA problem, I understand we can just see if the result is correct by exactly match, but I do not know how can we verify the proof since LLMs always give us a correct final result whether the proof is correct or not.

By the way, I will be greatly appreciated if you can also share how many s1-prob questions are selected for the final s1K training set? I try to find it in the filter.ipynb, but I only find a type called stats_qual.

@wy96f
Copy link

wy96f commented Feb 24, 2025

Thanks for the excellent work. Can you share how do you verify the correctness of the generated proof in the training set? For QA problem, I understand we can just see if the result is correct by exactly match, but I do not know how can we verify the proof since LLMs always give us a correct final result whether the proof is correct or not.

By the way, I will be greatly appreciated if you can also share how many s1-prob questions are selected for the final s1K training set? I try to find it in the filter.ipynb, but I only find a type called stats_qual.

def do_grading(response_dir: str = "Qwen_Qwen2_5_32B_Instruct"):

You can refer "B.3. s1K grading prompt" in the paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants