You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is a general question in DL training. If train from scratch, you could try warm up learning rate, or turn off AMP for first a few epochs and see if they work. If fine-tuning from MAISI VAE, please use a small lr like 1e-7. Thank you for the feedback. We will add these descriptions to readme.
Hi,
When I train the diffusion model with the trained VAE autoencoder weights, I encounter the issue of NaN loss. The following is part log:
lr: [0.0001]
lr: [0.0001]
Epoch 201 train_vae_loss 0.039707845827617515: {'recons_loss': 0.015235490621573968, 'kl_loss': 85897.05040993346, 'p_loss': 0.05294216721683401}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 202 train_vae_loss 0.036837538356057416: {'recons_loss': 0.013532256549082612, 'kl_loss': 84689.69741415161, 'p_loss': 0.049454373551865494}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 203 train_vae_loss 0.04066830881296869: {'recons_loss': 0.01579887273158354, 'kl_loss': 86930.33920508555, 'p_loss': 0.053921340536255344}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 204 train_vae_loss 0.04466726511873636: {'recons_loss': 0.017411270744553255, 'kl_loss': 86347.42582580798, 'p_loss': 0.06207083930534102}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 205 train_vae_loss 0.039213172850076874: {'recons_loss': 0.014744859089642877, 'kl_loss': 85096.7134475998, 'p_loss': 0.05319547471891338}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 206 train_vae_loss 0.038383807665236705: {'recons_loss': 0.014200327562047841, 'kl_loss': 85760.27710610742, 'p_loss': 0.05202484130859375}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 207 train_vae_loss 0.03932591601622308: {'recons_loss': 0.014606543448346422, 'kl_loss': 87033.87854978612, 'p_loss': 0.05338661570966017}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 208 train_vae_loss 0.058224454522186636: {'recons_loss': 0.022603081671181118, 'kl_loss': 143004.20642229088, 'p_loss': 0.07106984069592145}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 209 train_vae_loss 0.04279889678179953: {'recons_loss': 0.016451959535408723, 'kl_loss': 91421.0720725404, 'p_loss': 0.057349433463789214}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 train_vae_loss 0.04419510093541189: {'recons_loss': 0.017602585414566184, 'kl_loss': 88753.12170270912, 'p_loss': 0.05905734450191599}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 val_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Epoch 211 train_vae_loss 0.047102449947657665: {'recons_loss': 0.018618881483757056, 'kl_loss': 99892.8082075808, 'p_loss': 0.061647625477141754}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 212 train_vae_loss 0.04109727745322826: {'recons_loss': 0.015783639634453017, 'kl_loss': 88632.56267823195, 'p_loss': 0.054834605169840185}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 213 train_vae_loss 0.04030142836448789: {'recons_loss': 0.015340764416353387, 'kl_loss': 87118.63764704135, 'p_loss': 0.054162667278101234}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 214 train_vae_loss 0.040564939826983476: {'recons_loss': 0.015572800811826333, 'kl_loss': 89774.16198312737, 'p_loss': 0.05338240938948134}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 215 train_vae_loss 0.04143597853471239: {'recons_loss': 0.01541382184043215, 'kl_loss': 96709.1665948788, 'p_loss': 0.054504133449307865}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 216 train_vae_loss 0.04318358794278379: {'recons_loss': 0.015925193886261985, 'kl_loss': 91295.16699590067, 'p_loss': 0.06042959118977246}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 217 train_vae_loss 0.05705382756280092: {'recons_loss': 0.020152890289860986, 'kl_loss': 109472.0783923479, 'p_loss': 0.08651243144568381}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 218 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 219 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 220 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Thanks a lot.
The text was updated successfully, but these errors were encountered: