Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAP does not significantly overperform CoT #140

Open
rtarikt opened this issue Feb 5, 2025 · 1 comment
Open

RAP does not significantly overperform CoT #140

rtarikt opened this issue Feb 5, 2025 · 1 comment

Comments

@rtarikt
Copy link

rtarikt commented Feb 5, 2025

Hi,

Thanks for the work. I have a question about the accuracy of RAP compared to CoT. In my experiments, RAP can only slightly achieve higher accuracy than CoT on GSM8K dataset using DeepSeek-R1-Distill-Qwen-7B. Considering the results presented in the paper, and the more advanced structure of RAP compared to CoT, I expect that the performance gap to be much higher. I am sharing my parameters and the results below. Can you please share all the parameters you have used for both RAP and CoT please. Additionally, have you experimented these methods using more recent models?

Note 1: This behavior is similar with other models, such as Qwen2.5-Math-7B-Instruct, and Deepthink-Reasoning-7B.

Note 2: When using 4-shot, the accuracies were significantly dropping, mostly due to wrong format issues. Therefore, all 10 of the samples were used for the results below.

Thanks in advance.

Table 1. Results.

Method Accuracy
CoT 0.755
RAP 0.769

Table 2. The parameter setting.

parameter value
n_shots 10
n_action 2
n_confidence 1
n_iters 1
depth_limit 10
temperature 0.7

System Info

Operating System = Linux
Python version = 3.10
Hardware = A40

@Ber666
Copy link
Collaborator

Ber666 commented Feb 12, 2025

Hi, thanks for your question!

The prompt and tree search formulation in RAP were primarily designed for base models rather than specialized math reasoning models like DeepSeek-R1-Distill-Qwen-7B, Qwen2.5-Math-7B-Instruct, or DeepThink-Reasoning-7B. These models have been fine-tuned to perform CoT-style reasoning, which can naturally reduce the performance gap between RAP and CoT in your experiments.

Additionally, in the long-CoT (r1/o1) paradigm, the model already exhibits certain search capabilities by generating token sequences, which serves as an alternative to explicit tree search. We're actively expanding our library to support long CoT-related work. It would be interesting to further explore the pros and cons of these two different paradigms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants