Instructions to train Doge

English | 简体中文

We provide detailed steps to train Doge in this guide, including pre-training Doge-Base, instruction fine-tuning Doge-Instruct, and reasoning fine-tuning Doge-R1.

Table of Contents

Installation
Pre-training Base model
Instruction Fine-tuning Instruct model
Reasoning Fine-tuning R1 model

1. Installation

Please follow the instructions in README to install the necessary dependencies.

2. Pre-training Base model

We provide a Doge checkpoint that can be further pre-trained on a new dataset. If you need it, please refer to here for more information.

2.1 Download the dataset

For the pre-training dataset, we selected the fineweb-edu-dedup high-quality text, cosmopedia-v2 synthetic instruction dataset, and supplemented python-edu and fine-math to ensure the model's code and math capabilities.

# Fill in the save path, cache path, and number of processes
python ./examples/utils/download_pt_datasets.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

Note

Due to the large size of the dataset, at least 2TB of storage space is required. If you do not have enough storage space, you can choose to download part of the dataset by yourself here. You can freely change the downloaded dataset. We provide this example just to reproduce the current open-source model.

2.2 Preprocess the dataset

We need to use the tokenizer to convert the dataset into input_ids and attention_mask that the model can accept. If you use LlamaTokenizer, the tokenizer's vocabulary size is 32768, and it uses [INST] and [/INST] to mark instructions. It also includes tool tokens, but we will not use them here. Datasets like cosmopedia-v2 include two fields, prompt and text, which we mark as user content and assistant content.

conversation = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]
return tokenizer.apply_chat_template(conversation, tokenize=True, padding='max_length', truncation=True, max_length=MAX_LENGTH, return_dict=True)

Of course, you can also add some instruction prompts by yourself, such as letting the model answer that it is Doge, not ChatGPT.

conversation = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am an AI assistant named `Doge`. I am a language model trained by the `SmallDoge` community based on the `Doge` architecture. My task is to provide appropriate answers and support based on the user's questions and requests."},
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": text},
]

Here we recommend using Doge-tokenizer to process the dataset. It is trained by the Llama-3.3 tokenizer on the smollm-corpus, with a vocabulary size of 32768. The training script can be found here.

# Fill in the dataset path, save path, tokenizer path, sample number, maximum length, and number of processes
python ./examples/utils/preprocess_pt_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_name_or_path SmallDoge/Doge-tokenizer --train_examples 128000000 --test_examples 1000 --max_length 2048 --num_proc 16

Note

We only keep the dataset with 256B tokens, and the ratio of fineweb-edu:cosmopedia-v2:python-edu:open-web-math = 7:2:0.5:0.5. If you need to train a larger model, please increase the scale of the dataset by yourself.

2.3 Concatenate the dataset

We concatenate the fineweb-edu_tokenized, cosmopedia-v2, python-edu, and finemath datasets into the pretrain dataset. Then we shuffle them in order seed=233, and split out 1,000 samples as the test set.

# Fill in the dataset path, save path, sample number, and number of processes
python ./examples/utils/concatenate_pt_datasets.py --datasets_dir ./datasets --save_dir ./datasets --train_examples 128000000 --test_examples 1000 --num_proc 16

2.4 Configure the model parameters

We configure a 20M small model for training and testing.

Model	Params	n_layers	d_model	d_ff	n_heads	kv_heads	n_exprets	n_expert_heads	n_expert_pre_head
Doge-20M	13M	8	256	512	2	1	-	-	-
Doge-60M	54M	16	512	1024	4	2	-	-	-
Doge-160M	152M	24	768	1536	6	3	-	-	-
Doge-320M	335M	32	1024	2048	8	4	-	-	-

n_layers is the number of decoder layers in the model
d_model is the hidden layer dimension of the model
n_heads is the number of heads of multi-head attention, d_model // n_heads is best kept above 64

Tip

The Doge-MoE model can inherit the dense activation parameters of the Doge model and increase the sparse activation parameters by setting n_experts, n_expert_heads, and n_expert_pre_head. If you want to increase the model parameters without increasing the computational cost, you can try setting the is_moe parameter of the model configuration to True and adjust the above parameters.

2.5 Configure the pre-training hyperparameters

Model	tokens	max_train_steps	accumulate_steps	learning_rate	scheduler	warmup_ratio	decay_ratio	weight_decay
Doge-20M	4B	8,000	256	8e-3	warmup_stable_decay	0.1	0.1	0.01
Doge-60M	16B	16,000	512	6e-3	warmup_stable_decay	0.1	0.1	0.01
Doge-160M	32B	24,000	768	4e-3	warmup_stable_decay	0.1	0.1	0.01
Doge-320M	64B	32,000	1024	2e-3	warmup_stable_decay	0.1	0.1	0.01

Tip

According to the experience of SmolLM blog, we will scale the parameters in Chinchilla by 10 times the scaling ratio of tokens. warmup_stable_decay is used to continue training with checkpoints on larger datasets at any time, see Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations.

2.6 Pre-training the model

We support training the model using Single GPU, DDP, or DeepSpeed ZeRO-2 and ZeRO-3. To switch between these four methods, simply change the path in the accelerate_configs YAML configuration in the recipes directory.

Note

We do not install DeepSpeed by default because Windows systems do not support it. If you need to use it, please install it yourself.

# You need to specify the configuration file path, all parameters are in the recipe configuration file
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/single_gpu.yaml ./src/small_doge/pt.py --config recipes/doge/Doge-20M/config_full.yaml

Note

The training command above is configured for a 1 x RTX 4090 (24GB) node. For different hardware and topologies, you may need to adjust the batch size and gradient accumulation steps.

2.7 Usage

After training is complete, we can use AutoModelForCausalLM of Transformers to load the model, and use AutoTokenizer to load LlamaTokenizer.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M", trust_remote_code=True)

inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.batch_decode(out))

2.8 Evaluation

We use the lighteval toolkit to evaluate the performance of the Doge model.

You can install the toolkit with the following command:

pip install lighteval

Note

By default, lighteval installs torch==2.4.1, so you may need to create a new environment to install it.

If you are a Linux user, you can use the following command to evaluate the model:

bash ./evaluation/eval_downstream_tasks.sh

If you are a Windows user, you can use the following command to evaluate the model:

. ./evaluation/eval_downstream_tasks.ps1

Tip

You can modify MODEL and OUTPUT_DIR in the script to evaluate different models and save the results to different directories.

3. Instruction Fine-tuning Instruct model

We provide a base Doge model that can be fine-tuned directly for instruction fine-tuning. If you need it, please refer to here for more information.

3.1 Download the dataset

For the fine-tuning dataset, we selected the smoltalk dataset for SFT and the ultrafeedback_binarized dataset for DPO.

# Fill in the save path, cache path, and number of processes
python ./examples/utils/download_ft_dataset.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

Tip

You can freely change the downloaded dataset. We provide this example just to reproduce the current open-source model.

3.2 Preprocess the dataset

We apply the fine-tuning dataset to the chat template.

# Fill in the dataset path, save path, tokenizer path, and number of processes
python ./examples/utils/preprocess_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_name_or_path SmallDoge/Doge-tokenizer --num_proc 8

Tip

You can add some instruction prompts in the template by yourself, such as letting the model answer that it is Doge, not ChatGPT.

3.3 Concatenate the dataset

If you download more datasets for fine-tuning, we need to concatenate and shuffle them to mix them together.

# Fill in the dataset path, save path, sample number, and number of processes
python ./examples/utils/concatenate_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --num_proc 16

3.4 Supervised Fine-tuning the model

We first SFT the model to make it generate responses that follow the prompt.

# You need to specify the configuration file path, all parameters are in the recipe configuration file
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/single_gpu.yaml ./src/small_doge/sft.py --config recipes/doge/Doge-20M-Instruct/sft/config_full.yaml

Note

The training command above is configured for a 1 x RTX 4090 (24GB) node. For different hardware and topologies, you may need to adjust the batch size and gradient accumulation steps.

3.5 Direct Preference Optimization the model

Then we use the DPO algorithm to align the model with human preferences after SFT.

# You need to specify the configuration file path, all parameters are in the recipe configuration file
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/single_gpu.yaml ./src/small_doge/dpo.py --config recipes/doge/Doge-20M-Instruct/dpo/config_full.yaml

Note

The training command above is configured for a 1 x RTX 4090 (24GB) node. For different hardware and topologies, you may need to adjust the batch size and gradient accumulation steps.

3.6 Usage

After fine-tuning is complete, we can use AutoModelForCausalLM of Transformers to load the model, and use AutoTokenizer to load LlamaTokenizer, and use GenerationConfig and TextStreamer to support streaming generation with sampling.

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M-Instruct")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M-Instruct", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True)

prompt = "Hi, how are you doing today?"

conversation = [
      {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

4. Reasoning Fine-tuning R1 model

Currently, the data for reasoning fine-tuning the teacher model is still relatively scarce. Here we provide the huggingface open-r1 project link. If you need it, you can use OpenAI's o1 or DeepSeek's R1 model to generate teacher model data according to the guide.

4.1 Download the dataset

For the fine-tuning dataset, we selected the Bespoke-Stratos-17k dataset for DFT and the NuminaMath-TIR dataset for GRPO.

# Fill in the save path, cache path, and number of processes
python ./examples/utils/download_ft_dataset.py --save_dir ./datasets --cache_dir ./cache --num_proc 1

Note

If you have completed the Instruction Fine-tuning Instruct model guide and have not changed the download dataset script or deleted the dataset, you can skip this step. [!TIP] You can freely change the downloaded dataset.

4.2 Preprocess the dataset

We apply the fine-tuning dataset to the thinking prompt.

# Fill in the dataset path, save path, tokenizer path, and number of processes
python ./examples/utils/preprocess_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --tokenizer_name_or_path SmallDoge/Doge-tokenizer --num_proc 8

Note

If you have completed the Instruction Fine-tuning Instruct model guide and have not changed the preprocess dataset script or deleted the dataset, you can skip this step. [!TIP] You can add some behavior instructions in the thinking prompt by yourself to build more interesting conversations.

4.3 Concatenate the dataset

If you download more datasets for fine-tuning, we need to concatenate and shuffle them to mix them together.

# Fill in the dataset path, save path, sample number, and number of processes
python ./examples/utils/concatenate_ft_datasets.py --datasets_dir ./datasets --save_dir ./datasets --num_proc 16

Note

If you have completed the Instruction Fine-tuning Instruct model guide and have not changed the concatenate dataset script or deleted the dataset, you can skip this step.

4.4 Distillation Fine-tuning the model

We first DFT the model to learn powerful thinking and reasoning capabilities from the teacher model.

# You need to specify the configuration file path, all parameters are in the recipe configuration file
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/single_gpu.yaml ./src/small_doge/sft.py --config recipes/doge/Doge-20M-R1/sft/config_full.yaml

Note

The training command above is configured for a 1 x RTX 4090 (24GB) node. For different hardware and topologies, you may need to adjust the batch size and gradient accumulation steps.

4.5 Group Relative Optimization the model

Then we use the GRPO algorithm to reinforce the model after DFT to make the model have the ability to think before answering, which is the GRPO algorithm.

# You need to specify the configuration file path, all parameters are in the recipe configuration file
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/single_gpu.yaml ./src/small_doge/grpo.py --config recipes/doge/Doge-20M-R1/grpo/config_full.yaml

Note

The training command above is configured for a 1 x RTX 4090 (24GB) node. For different hardware and topologies, you may need to adjust the batch size and gradient accumulation steps.

4.6 Usage

After fine-tuning is complete, we can use AutoModelForCausalLM of Transformers to load the model, and use AutoTokenizer to load LlamaTokenizer, and use GenerationConfig and TextStreamer to support streaming generation with sampling.

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M-R1")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M-R1", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=1000, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True)

prompt = "Hi, how are you doing today?"

conversation = [
      {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Instructions to train Doge

English | 简体中文

1. Installation

2. Pre-training Base model

2.1 Download the dataset

2.2 Preprocess the dataset

2.3 Concatenate the dataset

2.4 Configure the model parameters

2.5 Configure the pre-training hyperparameters

2.6 Pre-training the model

2.7 Usage

2.8 Evaluation

3. Instruction Fine-tuning Instruct model

3.1 Download the dataset

3.2 Preprocess the dataset

3.3 Concatenate the dataset

3.4 Supervised Fine-tuning the model

3.5 Direct Preference Optimization the model

3.6 Usage

4. Reasoning Fine-tuning R1 model

4.1 Download the dataset

4.2 Preprocess the dataset

4.3 Concatenate the dataset

4.4 Distillation Fine-tuning the model

4.5 Group Relative Optimization the model

4.6 Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Instructions to train Doge

English | 简体中文

1. Installation

2. Pre-training Base model

2.1 Download the dataset

2.2 Preprocess the dataset

2.3 Concatenate the dataset

2.4 Configure the model parameters

2.5 Configure the pre-training hyperparameters

2.6 Pre-training the model

2.7 Usage

2.8 Evaluation

3. Instruction Fine-tuning Instruct model

3.1 Download the dataset

3.2 Preprocess the dataset

3.3 Concatenate the dataset

3.4 Supervised Fine-tuning the model

3.5 Direct Preference Optimization the model

3.6 Usage

4. Reasoning Fine-tuning R1 model

4.1 Download the dataset

4.2 Preprocess the dataset

4.3 Concatenate the dataset

4.4 Distillation Fine-tuning the model

4.5 Group Relative Optimization the model

4.6 Usage