You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the work. I have a question about the accuracy of RAP compared to CoT. In my experiments, RAP can only slightly achieve higher accuracy than CoT on GSM8K dataset using DeepSeek-R1-Distill-Qwen-7B. Considering the results presented in the paper, and the more advanced structure of RAP compared to CoT, I expect that the performance gap to be much higher. I am sharing my parameters and the results below. Can you please share all the parameters you have used for both RAP and CoT please. Additionally, have you experimented these methods using more recent models?
Note 1: This behavior is similar with other models, such as Qwen2.5-Math-7B-Instruct, and Deepthink-Reasoning-7B.
Note 2: When using 4-shot, the accuracies were significantly dropping, mostly due to wrong format issues. Therefore, all 10 of the samples were used for the results below.
Thanks in advance.
Table 1. Results.
Method
Accuracy
CoT
0.755
RAP
0.769
Table 2. The parameter setting.
parameter
value
n_shots
10
n_action
2
n_confidence
1
n_iters
1
depth_limit
10
temperature
0.7
System Info
Operating System = Linux
Python version = 3.10
Hardware = A40
The text was updated successfully, but these errors were encountered:
The prompt and tree search formulation in RAP were primarily designed for base models rather than specialized math reasoning models like DeepSeek-R1-Distill-Qwen-7B, Qwen2.5-Math-7B-Instruct, or DeepThink-Reasoning-7B. These models have been fine-tuned to perform CoT-style reasoning, which can naturally reduce the performance gap between RAP and CoT in your experiments.
Additionally, in the long-CoT (r1/o1) paradigm, the model already exhibits certain search capabilities by generating token sequences, which serves as an alternative to explicit tree search. We're actively expanding our library to support long CoT-related work. It would be interesting to further explore the pros and cons of these two different paradigms.
Hi,
Thanks for the work. I have a question about the accuracy of RAP compared to CoT. In my experiments, RAP can only slightly achieve higher accuracy than CoT on GSM8K dataset using DeepSeek-R1-Distill-Qwen-7B. Considering the results presented in the paper, and the more advanced structure of RAP compared to CoT, I expect that the performance gap to be much higher. I am sharing my parameters and the results below. Can you please share all the parameters you have used for both RAP and CoT please. Additionally, have you experimented these methods using more recent models?
Note 1: This behavior is similar with other models, such as Qwen2.5-Math-7B-Instruct, and Deepthink-Reasoning-7B.
Note 2: When using 4-shot, the accuracies were significantly dropping, mostly due to wrong format issues. Therefore, all 10 of the samples were used for the results below.
Thanks in advance.
Table 1. Results.
Table 2. The parameter setting.
System Info
Operating System = Linux
Python version = 3.10
Hardware = A40
The text was updated successfully, but these errors were encountered: