Replies: 1 comment 7 replies
-
Hey, glad to talk to you again! Which diffusion model did you end up attempting to train? The problem you mention (where a surrogate loss was required) was unique to the U-net architecture. After I moved over to the "flat diffusion" models (which are just transformer encoders trained with diffusion objectives), I no longer needed the surrogate losses. This is the model architecture used in Tortoise. Though, if you're feeling a bit brave, I would highly recommend using this architecture instead. It uses something similar to RRDB with some attention mixed in and drops out the conditioning input, which I found to not be useful in the diffusion decoder. I have trained one of these models on the Tortoise codes and the performance (by tts-scores) is superior to what shipped with this repo. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am back!
after spending a lot of time on diffusion models I tried to do replicate your experiment, with the original repo by lucidrains. I dont have immense data, so trying with a limited number of speakers. I conditioned my vqvae codes onto the unet. Although the speech had some human form, but was completely giberish in terms of language. I noticed you alleviated this problem by using a surrogate loss as mentioned here: https://nonint.com/2022/04/04/190/. I am a bit confused, are you using a simple network to get a loss between the embeddings of the codes and the codes themselves?
P.S. I have opened a new discussion panel, so that its easier for people to navigate and ask questions.
Beta Was this translation helpful? Give feedback.
All reactions