Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on Qwen2.5-7B-Instruct #42

Open
lichangh20 opened this issue Feb 11, 2025 · 11 comments
Open

Performance on Qwen2.5-7B-Instruct #42

lichangh20 opened this issue Feb 11, 2025 · 11 comments

Comments

@lichangh20
Copy link

Thank you for the excellent work!

I trained the Qwen2.5-7B-Instruct model using the provided training script on 4 H100 GPUs. To prevent out-of-memory errors, I set the mini_batch_size to 2 while keeping all other parameters at their default values. Below is my training script:

uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=2 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--num_train_epochs=${epochs} \
--max_steps=${max_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="no" \
--eval_steps=50 \
--logging_steps=1 \
--save_strategy="no" \
--lr_scheduler_type="cosine" \
--learning_rate=${lr} \
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir="ckpts/s1_${uid}" \
--hub_model_id="simplescaling7b/s1-${uid}" \
--push_to_hub=${push_to_hub} \
--save_only_model=True \
--gradient_checkpointing=True

Here is my training loss curve, which closely resembles the one reported in the paper.
Image

However, the performance on the evaluation benchmarks has not shown a significant improvement. Specifically, the original Qwen-2.5-7B-Instruct model achieves 16.67% on AIME 2024, 33.84% on GPQA Diamond, and 77% on MATH500. After fine-tuning, the results are 16.67% on AIME 2024, 37.37% on GPQA Diamond, and 75.2% on MATH500, indicating only a marginal improvement.

I’m wondering whether s1k is specifically designed for the Qwen2.5-32B-Instruct model or if it can generalize to models of different sizes. Thank you!

Initial Qwen-2.5-7B-Instruct results:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.3383838383838384,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.77,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },

Results after model fine-tuning:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.37373737373737376,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.752,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },
@Muennighoff
Copy link
Contributor

Hmm I didn't try other sizes but I'd recommend retrying with our latest version: https://huggingface.co/datasets/simplescaling/s1K-1.1 ; It performs much better!

@lichangh20
Copy link
Author

@Muennighoff Thanks for the quick response! Will try if s1K-1.1 can make 7B model work better

@lllyyyqqq
Copy link

lllyyyqqq commented Feb 11, 2025

I've tried with s1K-deepseek-R1-tokenized(is is current s1k-1.1 now?)on Qwen2.5-7B-Instruct:
{
"results": {
"aime24_figures": {
"alias": "aime24_figures",
"exact_match,none": 0.2,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"aime24_nofigures": {
"alias": "aime24_nofigures",
"exact_match,none": 0.16666666666666666,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"gpqa_diamond_openai": {
"alias": "gpqa_diamond_openai",
"exact_match,none": 0.4090909090909091,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
},
"openai_math": {
"alias": "openai_math",
"exact_match,none": 0.83,
"exact_match_stderr,none": "N/A",
"extracted_answers,none": -1,
"extracted_answers_stderr,none": "N/A"
}
},
the results are better, however, I found some answers are inreadable by repeating words.

@Muennighoff
Copy link
Contributor

Nice!! Yes I renamed it to s1k-1.1

@Muennighoff
Copy link
Contributor

I found some answers are inreadable by repeating words.

What do you mean by this?

@lllyyyqqq
Copy link

here is an example:
So that's the answer? That seems straightforward. Let me verify.\n\nIf xy = 25, then substituting back into equations 1 and 2:\n\nEquation 1: x^{10} = y^x = (25/x)^x. So, x^{10} = (25^x)/x^x. Multiply both sides by x^x: x^{10} x^x = 25^x => x^{10 + x} = 25^x. Then, divide both sides by x^10: x^x = 25^{x}/x^{10} = (25/x)^10. But since y = 25/x, then (25/x)^10 = y^{10}, which is consistent with equation 1. So equation 1 holds.\n\nSimilarly, equation 2: y^{10} = x^{4y} = x^{4*(25/x)} = x^{100/x}. Then, since x^{100/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} = (x^{10})^{10/x} .......

@Muennighoff
Copy link
Contributor

Hmm I see; You can try resolve this by setting a temperature bigger than 0 but not guaranteed it will help.

Also, when setting max_tokens_thinking=auto in the evaluation it will still force the model to produce an answer even if it is caught in an infinite loop.

@lichangh20
Copy link
Author

lichangh20 commented Feb 11, 2025

@lllyyyqqq Could you share your script for training the 7B model on the s1K-1.1 dataset? When I attempt to perform SFT, I encounter an OOM error, even with a micro batch size of 1. Upon further investigation, I found that the context length in s1K-1.1 is nearly twice as long on average compared to s1K, which likely contributes to the issue.

Besides, I think the training script does not utilize model parallelism, which may lead to additional memory consumption.

s1K context length:

Image

s1K-1.1 context length:

Image

@karsar
Copy link

karsar commented Feb 13, 2025

I've also encountered OOM error trying to run 7B model. In my case I was trying to fit it to 2xA100 GPUs. Having microbatch 1 and increasing gradient accumulations didn't help. What helped is to reduce block size to block_size: int = field(default=4096) at train/sft.py file. In my case I didn't try to find the maximum block size still fitting the memory, just liberally reduced it. After 1.5 hours it took to finish I've got 13.33% -> 16.67% (aime24), openai_math 76.2% -> 76.6% improvement (with no Wait forcing at inference). I didn't try gpqa_diamond_openai...

@lllyyyqqq
Copy link

lllyyyqqq commented Feb 13, 2025

I've also encountered OOM problem using the original script. Instead, I use deepspeed ZERO3 and optimizer-offloading and gradient checkpointing, micro batch 1. I can successfully finetune it on 8 A100 with s1K-deepseek-R1-tokenized for 2.5 hours.

deepspeed_zero3.yaml:
compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: cpu offload_param_device: none zero3_init_flag: false zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

sft.sh:
uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
micro_batch_size=1 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero3.yaml
train/sft.py
--per_device_train_batch_size=${micro_batch_size}
--per_device_eval_batch_size=${micro_batch_size}
--gradient_accumulation_steps=${gradient_accumulation_steps}
--num_train_epochs=${epochs}
--max_steps=${max_steps}
--train_file_path=" s1K-deepseek-R1-tokenized"
--model_name=${base_model}
--warmup_ratio=0.05
--bf16=True
--eval_strategy="steps"
--eval_steps=50
--logging_steps=1
--lr_scheduler_type="cosine"
--learning_rate=${lr}
--weight_decay=1e-4
--adam_beta1=0.9
--adam_beta2=0.95
--output_dir="ckpts/s1_${uid}"
--save_only_model=True
--gradient_checkpointing=True
--save_strategy=no
--dataset_text_field="text"

@HaitaoWuTJU
Copy link

success!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants