Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing trainer_state.json #82

Open
anavarroa opened this issue Feb 21, 2025 · 0 comments
Open

Missing trainer_state.json #82

anavarroa opened this issue Feb 21, 2025 · 0 comments

Comments

@anavarroa
Copy link

I was training TinyZero when my process was unexpectedly interrupted due to another process consuming all available GPU memory. I want to resume training from the last checkpoint at global_step_4100, but I noticed that the trainer_state.json file is missing from my checkpoint directory.

  • My checkpoint directory contains the model weights and tokenizer files but not trainer_state.json.
  • I modified the training script to load my actor and critic models from the latest checkpoint:
    actor_rollout_ref.model.path="/path/to/checkpoints/TinyZero/test-run-4/actor/global_step_4100" critic.model.path="/path/to/checkpoints/TinyZero/test-run-4/critic/global_step_4100"
  • When restarting training, it does not pick up from step 4100. Instead, it starts from step 1 again.
  • I searched for trainer_state.json in my checkpoint directory using find, but it is not there.
  • I checked previous checkpoints, and they also do not contain trainer_state.json.

Is trainer_state.json necessary to resume training properly? And if so, is there a way to manually create or reconstruct it from the existing checkpoint files? Are there any settings I need to adjust in my training script to ensure proper resumption?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant