-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on Qwen2.5-7B-Instruct #42
Comments
Hmm I didn't try other sizes but I'd recommend retrying with our latest version: https://huggingface.co/datasets/simplescaling/s1K-1.1 ; It performs much better! |
@Muennighoff Thanks for the quick response! Will try if s1K-1.1 can make 7B model work better |
I've tried with s1K-deepseek-R1-tokenized(is is current s1k-1.1 now?)on Qwen2.5-7B-Instruct: |
Nice!! Yes I renamed it to s1k-1.1 |
What do you mean by this? |
here is an example: |
Hmm I see; You can try resolve this by setting a temperature bigger than 0 but not guaranteed it will help. Also, when setting |
@lllyyyqqq Could you share your script for training the 7B model on the s1K-1.1 dataset? When I attempt to perform SFT, I encounter an OOM error, even with a micro batch size of 1. Upon further investigation, I found that the context length in s1K-1.1 is nearly twice as long on average compared to s1K, which likely contributes to the issue. Besides, I think the training script does not utilize model parallelism, which may lead to additional memory consumption. s1K context length: ![]() s1K-1.1 context length: ![]() |
I've also encountered OOM error trying to run 7B model. In my case I was trying to fit it to 2xA100 GPUs. Having microbatch 1 and increasing gradient accumulations didn't help. What helped is to reduce block size to |
I've also encountered OOM problem using the original script. Instead, I use deepspeed ZERO3 and optimizer-offloading and gradient checkpointing, micro batch 1. I can successfully finetune it on 8 A100 with s1K-deepseek-R1-tokenized for 2.5 hours. deepspeed_zero3.yaml: sft.sh: ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero3.yaml |
success! |
Thank you for the excellent work!
I trained the Qwen2.5-7B-Instruct model using the provided training script on 4 H100 GPUs. To prevent out-of-memory errors, I set the mini_batch_size to 2 while keeping all other parameters at their default values. Below is my training script:
Here is my training loss curve, which closely resembles the one reported in the paper.

However, the performance on the evaluation benchmarks has not shown a significant improvement. Specifically, the original Qwen-2.5-7B-Instruct model achieves 16.67% on AIME 2024, 33.84% on GPQA Diamond, and 77% on MATH500. After fine-tuning, the results are 16.67% on AIME 2024, 37.37% on GPQA Diamond, and 75.2% on MATH500, indicating only a marginal improvement.
I’m wondering whether s1k is specifically designed for the Qwen2.5-32B-Instruct model or if it can generalize to models of different sizes. Thank you!
Initial Qwen-2.5-7B-Instruct results:
Results after model fine-tuning:
The text was updated successfully, but these errors were encountered: