Missing trainer_state.json #82

anavarroa · 2025-02-21T08:16:44Z

I was training TinyZero when my process was unexpectedly interrupted due to another process consuming all available GPU memory. I want to resume training from the last checkpoint at global_step_4100, but I noticed that the trainer_state.json file is missing from my checkpoint directory.

My checkpoint directory contains the model weights and tokenizer files but not trainer_state.json.
I modified the training script to load my actor and critic models from the latest checkpoint:
actor_rollout_ref.model.path="/path/to/checkpoints/TinyZero/test-run-4/actor/global_step_4100" critic.model.path="/path/to/checkpoints/TinyZero/test-run-4/critic/global_step_4100"
When restarting training, it does not pick up from step 4100. Instead, it starts from step 1 again.
I searched for trainer_state.json in my checkpoint directory using find, but it is not there.
I checked previous checkpoints, and they also do not contain trainer_state.json.

Is trainer_state.json necessary to resume training properly? And if so, is there a way to manually create or reconstruct it from the existing checkpoint files? Are there any settings I need to adjust in my training script to ensure proper resumption?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing trainer_state.json #82

Missing trainer_state.json #82

anavarroa commented Feb 21, 2025

Missing trainer_state.json #82

Missing trainer_state.json #82

Comments

anavarroa commented Feb 21, 2025