Performance on Qwen2.5-7B-Instruct #42

lichangh20 · 2025-02-11T01:10:33Z

Thank you for the excellent work!

I trained the Qwen2.5-7B-Instruct model using the provided training script on 4 H100 GPUs. To prevent out-of-memory errors, I set the mini_batch_size to 2 while keeping all other parameters at their default values. Below is my training script:

uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=2 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--num_train_epochs=${epochs} \
--max_steps=${max_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="no" \
--eval_steps=50 \
--logging_steps=1 \
--save_strategy="no" \
--lr_scheduler_type="cosine" \
--learning_rate=${lr} \
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir="ckpts/s1_${uid}" \
--hub_model_id="simplescaling7b/s1-${uid}" \
--push_to_hub=${push_to_hub} \
--save_only_model=True \
--gradient_checkpointing=True

Here is my training loss curve, which closely resembles the one reported in the paper.

However, the performance on the evaluation benchmarks has not shown a significant improvement. Specifically, the original Qwen-2.5-7B-Instruct model achieves 16.67% on AIME 2024, 33.84% on GPQA Diamond, and 77% on MATH500. After fine-tuning, the results are 16.67% on AIME 2024, 37.37% on GPQA Diamond, and 75.2% on MATH500, indicating only a marginal improvement.

I’m wondering whether s1k is specifically designed for the Qwen2.5-32B-Instruct model or if it can generalize to models of different sizes. Thank you!

Initial Qwen-2.5-7B-Instruct results:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.3383838383838384,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.77,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },

Results after model fine-tuning:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.37373737373737376,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.752,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },

Muennighoff · 2025-02-11T01:12:22Z

Hmm I didn't try other sizes but I'd recommend retrying with our latest version: https://huggingface.co/datasets/simplescaling/s1K-1.1 ; It performs much better!

lichangh20 · 2025-02-11T01:33:44Z

@Muennighoff Thanks for the quick response! Will try if s1K-1.1 can make 7B model work better

lllyyyqqq · 2025-02-11T05:56:59Z

I've tried with s1K-deepseek-R1-tokenized(is is current s1k-1.1 now?）on Qwen2.5-7B-Instruct：
{
"results": {
"aime24_figures": {
"alias": "aime24_figures",
"exact_match,none": 0.2,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"aime24_nofigures": {
"alias": "aime24_nofigures",
"exact_match,none": 0.16666666666666666,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"gpqa_diamond_openai": {
"alias": "gpqa_diamond_openai",
"exact_match,none": 0.4090909090909091,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"openai_math": {
"alias": "openai_math",
"exact_match,none": 0.83,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
}
},
the results are better, however, I found some answers are inreadable by repeating words.

Muennighoff · 2025-02-11T06:15:03Z

Nice!! Yes I renamed it to s1k-1.1

Muennighoff · 2025-02-11T06:15:14Z

I found some answers are inreadable by repeating words.

What do you mean by this?

lllyyyqqq · 2025-02-11T06:18:35Z

here is an example:
So that's the answer? That seems straightforward. Let me verify.\n\nIf xy = 25, then substituting back into equations 1 and 2:\n\nEquation 1: x^{10} = y^x = (25/x)^x. So, x^{10} = (25^x)/x^x. Multiply both sides by x^x: x^{10} x^x = 25^x => x^{10 + x} = 25^x. Then, divide both sides by x^10: x^x = 25^{x}/x^{10} = (25/x)^10. But since y = 25/x, then (25/x)^10 = y^{10}, which is consistent with equation 1. So equation 1 holds.\n\nSimilarly, equation 2: y^{10} = x^{4y} = x^{4*(25/x)} = x^{100/x}. Then, since x^{100/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} .......

Muennighoff · 2025-02-11T06:25:50Z

Hmm I see; You can try resolve this by setting a temperature bigger than 0 but not guaranteed it will help.

Also, when setting max_tokens_thinking=auto in the evaluation it will still force the model to produce an answer even if it is caught in an infinite loop.

lichangh20 · 2025-02-11T17:43:22Z

@lllyyyqqq Could you share your script for training the 7B model on the s1K-1.1 dataset? When I attempt to perform SFT, I encounter an OOM error, even with a micro batch size of 1. Upon further investigation, I found that the context length in s1K-1.1 is nearly twice as long on average compared to s1K, which likely contributes to the issue.

Besides, I think the training script does not utilize model parallelism, which may lead to additional memory consumption.

s1K context length:

s1K-1.1 context length:

karsar · 2025-02-13T01:25:52Z

I've also encountered OOM error trying to run 7B model. In my case I was trying to fit it to 2xA100 GPUs. Having microbatch 1 and increasing gradient accumulations didn't help. What helped is to reduce block size to block_size: int = field(default=4096) at train/sft.py file. In my case I didn't try to find the maximum block size still fitting the memory, just liberally reduced it. After 1.5 hours it took to finish I've got 13.33% -> 16.67% (aime24), openai_math 76.2% -> 76.6% improvement (with no Wait forcing at inference). I didn't try gpqa_diamond_openai...

lllyyyqqq · 2025-02-13T01:41:27Z

I've also encountered OOM problem using the original script. Instead, I use deepspeed ZERO3 and optimizer-offloading and gradient checkpointing, micro batch 1. I can successfully finetune it on 8 A100 with s1K-deepseek-R1-tokenized for 2.5 hours.

deepspeed_zero3.yaml:
compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: cpu offload_param_device: none zero3_init_flag: false zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

sft.sh:
uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
micro_batch_size=1 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero3.yaml
train/sft.py
--per_device_train_batch_size=${micro_batch_size}
--per_device_eval_batch_size=${micro_batch_size}
--gradient_accumulation_steps=${gradient_accumulation_steps}
--num_train_epochs=${epochs}
--max_steps=${max_steps}
--train_file_path=" s1K-deepseek-R1-tokenized"
--model_name=${base_model}
--warmup_ratio=0.05
--bf16=True
--eval_strategy="steps"
--eval_steps=50
--logging_steps=1
--lr_scheduler_type="cosine"
--learning_rate=${lr}
--weight_decay=1e-4
--adam_beta1=0.9
--adam_beta2=0.95
--output_dir="ckpts/s1_${uid}"
--save_only_model=True
--gradient_checkpointing=True
--save_strategy=no
--dataset_text_field="text"

HaitaoWuTJU · 2025-02-14T13:17:15Z

success!

Muennighoff mentioned this issue Feb 22, 2025

Extending s1.1 to smaller models #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on Qwen2.5-7B-Instruct #42

Performance on Qwen2.5-7B-Instruct #42

lichangh20 commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lichangh20 commented Feb 11, 2025

lllyyyqqq commented Feb 11, 2025 •

edited

Loading

Muennighoff commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lllyyyqqq commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lichangh20 commented Feb 11, 2025 •

edited

Loading

karsar commented Feb 13, 2025

lllyyyqqq commented Feb 13, 2025 •

edited

Loading

HaitaoWuTJU commented Feb 14, 2025

Performance on Qwen2.5-7B-Instruct #42

Performance on Qwen2.5-7B-Instruct #42

Comments

lichangh20 commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lichangh20 commented Feb 11, 2025

lllyyyqqq commented Feb 11, 2025 • edited Loading

Muennighoff commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lllyyyqqq commented Feb 11, 2025

Muennighoff commented Feb 11, 2025

lichangh20 commented Feb 11, 2025 • edited Loading

karsar commented Feb 13, 2025

lllyyyqqq commented Feb 13, 2025 • edited Loading

HaitaoWuTJU commented Feb 14, 2025

lllyyyqqq commented Feb 11, 2025 •

edited

Loading

lichangh20 commented Feb 11, 2025 •

edited

Loading

lllyyyqqq commented Feb 13, 2025 •

edited

Loading