Diffusion Model Details #134

Malik7115 · 2022-07-25T08:14:26Z

Malik7115
Jul 25, 2022

Hi, I am back!

after spending a lot of time on diffusion models I tried to do replicate your experiment, with the original repo by lucidrains. I dont have immense data, so trying with a limited number of speakers. I conditioned my vqvae codes onto the unet. Although the speech had some human form, but was completely giberish in terms of language. I noticed you alleviated this problem by using a surrogate loss as mentioned here: https://nonint.com/2022/04/04/190/. I am a bit confused, are you using a simple network to get a loss between the embeddings of the codes and the codes themselves?

P.S. I have opened a new discussion panel, so that its easier for people to navigate and ask questions.

neonbjb · 2022-07-25T15:14:56Z

neonbjb
Jul 25, 2022
Maintainer

Hey, glad to talk to you again!

Which diffusion model did you end up attempting to train? The problem you mention (where a surrogate loss was required) was unique to the U-net architecture. After I moved over to the "flat diffusion" models (which are just transformer encoders trained with diffusion objectives), I no longer needed the surrogate losses.

This is the model architecture used in Tortoise. Though, if you're feeling a bit brave, I would highly recommend using this architecture instead. It uses something similar to RRDB with some attention mixed in and drops out the conditioning input, which I found to not be useful in the diffusion decoder. I have trained one of these models on the Tortoise codes and the performance (by tts-scores) is superior to what shipped with this repo.

7 replies

Malik7115 Jul 28, 2022
Author

Thanks for the detailed reply. I am still a novice in this field and trying my best to learn :)
TBH I went guns blazing while training for my first iteration and did not save any loss logs as graphs. I only have a log file that i referenced to check the overall performance of the model. As for samples, I think i exposed the model to about 50 hrs of data.

neonbjb Jul 28, 2022
Maintainer

Ah OK. I would highly recommend building an evaluation harness around your diffusion model to track training progress. Every 1000 iterations or so you can perform the full diffusion process on 100 samples or so and run them through a wav2vec2 model (or your CLVP model).

This will let you track whether or not you are making forward progress with the training. If not, something is wrong and you should debug. This is largely how I was able to ultimately finish up Tortoise.

Malik7115 Aug 4, 2022
Author

Hello Hello,
I am so excited to inform you that the diffusion is now finally speaking....with my limited dataset it is no longer gibberish anymore. The quality is not exactly what I would call great but I think I will stop it here and move towards the AR model now. Ill open another discussion panel once I start on it, so that it might be helpful to others as well.

neonbjb Aug 4, 2022
Maintainer

Awesome! Glad to hear that! 👍👍👍

What ended up being the problem?

Malik7115 Aug 5, 2022
Author

I am pretty sure it was the Unet...;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion Model Details #134

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Diffusion Model Details #134

Malik7115 Jul 25, 2022

Replies: 1 comment · 7 replies

neonbjb Jul 25, 2022 Maintainer

Malik7115 Jul 28, 2022 Author

neonbjb Jul 28, 2022 Maintainer

Malik7115 Aug 4, 2022 Author

neonbjb Aug 4, 2022 Maintainer

Malik7115 Aug 5, 2022 Author

Malik7115
Jul 25, 2022

Replies: 1 comment 7 replies

neonbjb
Jul 25, 2022
Maintainer

Malik7115 Jul 28, 2022
Author

neonbjb Jul 28, 2022
Maintainer

Malik7115 Aug 4, 2022
Author

neonbjb Aug 4, 2022
Maintainer

Malik7115 Aug 5, 2022
Author