You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was training TinyZero when my process was unexpectedly interrupted due to another process consuming all available GPU memory. I want to resume training from the last checkpoint at global_step_4100, but I noticed that the trainer_state.json file is missing from my checkpoint directory.
My checkpoint directory contains the model weights and tokenizer files but not trainer_state.json.
I modified the training script to load my actor and critic models from the latest checkpoint: actor_rollout_ref.model.path="/path/to/checkpoints/TinyZero/test-run-4/actor/global_step_4100" critic.model.path="/path/to/checkpoints/TinyZero/test-run-4/critic/global_step_4100"
When restarting training, it does not pick up from step 4100. Instead, it starts from step 1 again.
I searched for trainer_state.json in my checkpoint directory using find, but it is not there.
I checked previous checkpoints, and they also do not contain trainer_state.json.
Is trainer_state.json necessary to resume training properly? And if so, is there a way to manually create or reconstruct it from the existing checkpoint files? Are there any settings I need to adjust in my training script to ensure proper resumption?
The text was updated successfully, but these errors were encountered:
I was training TinyZero when my process was unexpectedly interrupted due to another process consuming all available GPU memory. I want to resume training from the last checkpoint at global_step_4100, but I noticed that the trainer_state.json file is missing from my checkpoint directory.
actor_rollout_ref.model.path="/path/to/checkpoints/TinyZero/test-run-4/actor/global_step_4100" critic.model.path="/path/to/checkpoints/TinyZero/test-run-4/critic/global_step_4100"
Is trainer_state.json necessary to resume training properly? And if so, is there a way to manually create or reconstruct it from the existing checkpoint files? Are there any settings I need to adjust in my training script to ensure proper resumption?
The text was updated successfully, but these errors were encountered: