RAP does not significantly overperform CoT #140

rtarikt · 2025-02-05T12:03:32Z

Hi,

Thanks for the work. I have a question about the accuracy of RAP compared to CoT. In my experiments, RAP can only slightly achieve higher accuracy than CoT on GSM8K dataset using DeepSeek-R1-Distill-Qwen-7B. Considering the results presented in the paper, and the more advanced structure of RAP compared to CoT, I expect that the performance gap to be much higher. I am sharing my parameters and the results below. Can you please share all the parameters you have used for both RAP and CoT please. Additionally, have you experimented these methods using more recent models?

Note 1: This behavior is similar with other models, such as Qwen2.5-Math-7B-Instruct, and Deepthink-Reasoning-7B.

Note 2: When using 4-shot, the accuracies were significantly dropping, mostly due to wrong format issues. Therefore, all 10 of the samples were used for the results below.

Thanks in advance.

Table 1. Results.

Method	Accuracy
CoT	0.755
RAP	0.769

Table 2. The parameter setting.

parameter	value
n_shots	10
n_action	2
n_confidence	1
n_iters	1
depth_limit	10
temperature	0.7

System Info

Operating System = Linux
Python version = 3.10
Hardware = A40

Ber666 · 2025-02-12T22:11:54Z

Hi, thanks for your question!

The prompt and tree search formulation in RAP were primarily designed for base models rather than specialized math reasoning models like DeepSeek-R1-Distill-Qwen-7B, Qwen2.5-Math-7B-Instruct, or DeepThink-Reasoning-7B. These models have been fine-tuned to perform CoT-style reasoning, which can naturally reduce the performance gap between RAP and CoT in your experiments.

Additionally, in the long-CoT (r1/o1) paradigm, the model already exhibits certain search capabilities by generating token sequences, which serves as an alternative to explicit tree search. We're actively expanding our library to support long CoT-related work. It would be interesting to further explore the pros and cons of these two different paradigms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAP does not significantly overperform CoT #140

RAP does not significantly overperform CoT #140

rtarikt commented Feb 5, 2025

Ber666 commented Feb 12, 2025

RAP does not significantly overperform CoT #140

RAP does not significantly overperform CoT #140

Comments

rtarikt commented Feb 5, 2025

System Info

Ber666 commented Feb 12, 2025