Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maisi: occuring NaN during the diffusion model training #1926

Open
shengzhang90 opened this issue Jan 22, 2025 · 1 comment
Open

maisi: occuring NaN during the diffusion model training #1926

shengzhang90 opened this issue Jan 22, 2025 · 1 comment

Comments

@shengzhang90
Copy link

Hi,

When I train the diffusion model with the trained VAE autoencoder weights, I encounter the issue of NaN loss. The following is part log:

lr: [0.0001]
lr: [0.0001]
Epoch 201 train_vae_loss 0.039707845827617515: {'recons_loss': 0.015235490621573968, 'kl_loss': 85897.05040993346, 'p_loss': 0.05294216721683401}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 202 train_vae_loss 0.036837538356057416: {'recons_loss': 0.013532256549082612, 'kl_loss': 84689.69741415161, 'p_loss': 0.049454373551865494}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 203 train_vae_loss 0.04066830881296869: {'recons_loss': 0.01579887273158354, 'kl_loss': 86930.33920508555, 'p_loss': 0.053921340536255344}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 204 train_vae_loss 0.04466726511873636: {'recons_loss': 0.017411270744553255, 'kl_loss': 86347.42582580798, 'p_loss': 0.06207083930534102}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 205 train_vae_loss 0.039213172850076874: {'recons_loss': 0.014744859089642877, 'kl_loss': 85096.7134475998, 'p_loss': 0.05319547471891338}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 206 train_vae_loss 0.038383807665236705: {'recons_loss': 0.014200327562047841, 'kl_loss': 85760.27710610742, 'p_loss': 0.05202484130859375}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 207 train_vae_loss 0.03932591601622308: {'recons_loss': 0.014606543448346422, 'kl_loss': 87033.87854978612, 'p_loss': 0.05338661570966017}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 208 train_vae_loss 0.058224454522186636: {'recons_loss': 0.022603081671181118, 'kl_loss': 143004.20642229088, 'p_loss': 0.07106984069592145}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 209 train_vae_loss 0.04279889678179953: {'recons_loss': 0.016451959535408723, 'kl_loss': 91421.0720725404, 'p_loss': 0.057349433463789214}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 train_vae_loss 0.04419510093541189: {'recons_loss': 0.017602585414566184, 'kl_loss': 88753.12170270912, 'p_loss': 0.05905734450191599}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 val_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Epoch 211 train_vae_loss 0.047102449947657665: {'recons_loss': 0.018618881483757056, 'kl_loss': 99892.8082075808, 'p_loss': 0.061647625477141754}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 212 train_vae_loss 0.04109727745322826: {'recons_loss': 0.015783639634453017, 'kl_loss': 88632.56267823195, 'p_loss': 0.054834605169840185}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 213 train_vae_loss 0.04030142836448789: {'recons_loss': 0.015340764416353387, 'kl_loss': 87118.63764704135, 'p_loss': 0.054162667278101234}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 214 train_vae_loss 0.040564939826983476: {'recons_loss': 0.015572800811826333, 'kl_loss': 89774.16198312737, 'p_loss': 0.05338240938948134}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 215 train_vae_loss 0.04143597853471239: {'recons_loss': 0.01541382184043215, 'kl_loss': 96709.1665948788, 'p_loss': 0.054504133449307865}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 216 train_vae_loss 0.04318358794278379: {'recons_loss': 0.015925193886261985, 'kl_loss': 91295.16699590067, 'p_loss': 0.06042959118977246}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 217 train_vae_loss 0.05705382756280092: {'recons_loss': 0.020152890289860986, 'kl_loss': 109472.0783923479, 'p_loss': 0.08651243144568381}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 218 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 219 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 220 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]

Thanks a lot.

@Can-Zhao
Copy link
Contributor

It is a general question in DL training. If train from scratch, you could try warm up learning rate, or turn off AMP for first a few epochs and see if they work. If fine-tuning from MAISI VAE, please use a small lr like 1e-7. Thank you for the feedback. We will add these descriptions to readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants