-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resume training does not work for multi-gpus training #23
Comments
@forever208 I have the same problem and the code gets stuck forever. On further investigation, I found that the test script
It is not clear to me why this fails. |
@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running |
@bahjat-kawar Thanks for tip and sorry for the delay in my response. |
@bahjat-kawar Although the model reloading was successful, I still face loss values going to NAN after retraining for a few iterations. All the three .pt files were reloaded, but this issue still persists. I assumed the opt.pt file should have some information of the optimizer parameters which should help continue the training. Did you also face this issue? |
@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ? |
solution: remove |
@forever208 Hello, do you use opt, model or ema's. pt file when using resume_checkpoint? |
@on I use model to do resume training (where both ema and opt will be loaded). use ema to do sampling |
@forever208 When I continue to train, |
@ONobody exactly |
@forever208 Thank you very much. |
@forever208 Hello, I would like to ask how to train Classifier guidance on my own data set. |
@ONobody I have no experience of using the classifier guidance, sorry for not being able to help you in this case |
@forever208 What about the calculation of FID IS and other evaluation indicators? |
@ONobody the author provides the instructions: https://github.com/openai/guided-diffusion/tree/main/evaluations |
@forever208 |
@ONobody if your own dataset only has one class, you can randomly draw 50k samples to form the reference_batch. Then you generated 50k samples using your trained model. Computing the FID by running the script
If your own dataset has more than 1 class, you'd better use the whole training set as the reference_batch. remember to convert your data into .npz format |
@forever208 |
@ONobody you have to keep them the same size. For example, your training data must be resized to 256 when doing the training. Then, your model generates a 256*256 sample. |
@forever208 When I make an assessment, |
@ONobody convert the training data into 256x256 --> training the model --> sampling 50k samples (256x256) from the model --> convert both referench batch (256x256) and 50k samples (256x256) into npz file --> compute FID |
@forever208 Thank you very much. |
Thanks! I meet the same problem. It works in my code! |
I add
--resume_checkpoint $path_to_checkpoint$
to continue the training, it works for a single GPU, but does not work for multi-gpusthe code gets stuck here:
Logging to /proj/ihorse_2021/users/x_manni/guided-diffusion/log9
creating model and diffusion...
creating data loader...
start training...
loading model from checkpoint: /proj/ihorse_2021/users/x_manni/guided-diffusion/log9/model200000.pt...
The text was updated successfully, but these errors were encountered: