AR Model Details/Discussion #139

Malik7115 · 2022-08-06T10:29:41Z

Malik7115
Aug 6, 2022

Hey.
So the dopamine spike from my half-baked Diff model has subsided now and I want a bigger kick so starting the AR model now xD.
Before I start I would love your input and help on the following.

for a precursor to the actual stuff I was thinking of learning only the prior not conditioned on texts. It will not give a coherent speech but would you suggest I go with this? Basically something like this
How do you apply a classification head for the transformer for two separate sequence types. One has the total categories == vocab_size and one has total_categories == 8194 (num of mel codes and start-stop tokens).
In creating the dataset/sequence do you pad at the end of this [conditioning_point_latent, text_embeddings, voice_embeddings] or do you pad after text_embeddings to make all sequences in a batch to be of the same length? I was thinking that this might also work if we provide the attention masks are input accordingly, since only the places where the mask is applied get ignored in the softmax function in the attention mechanism.
I tried to go through your implementation and understand some things, but found some parts to be quite recondite (for me only, probably not for others!) What exactly is happening in this function. Why is there a fake input thats being passed here and not the actual text tokens or other meaningful stuff. How exactly does the model generate the relevant mel codes using only fake inputs as prompts

neonbjb · 2022-08-06T14:54:21Z

neonbjb
Aug 6, 2022
Maintainer

Your call. Could be an interesting experiment to see what comes out. Fun fact: If you screw up your text conditioning somehow, you'll basically end up with this anyways. :)
Use different embeddings for both. I also used different positional embeddings as well, but there are probably other ways to skin that cat. You just want to make sure the model has a clear (learnable) signal for when it is attending to a text token vs a speech token.
The public version of Tortoise used the [conditioning,text,voice][pad] approach. My first revisions used the [conditioning][text][pad][voice][pad] approach. The latter is the easier and least error prone way to go. The model can even learn that the fixed positional embeddings correspond to which token type. It's also really inefficient and won't scale to longer text/audio than what you train on (which is likely to be short because of the inefficiency). Fixed padding is probably a good first experiment to make sure everything else is working well.
I'd recommend you ignore inference_speech and GPT2InferenceModel. These were horrible hacks that I did to make my model compatible with hugging face's generation API. It has a bunch of optimizations which makes inference faster and I was too lazy to implement them myself. There is no reason you can't just implement nucleus sampling yourself and potentially optimize later. If you really want to understand the code, though: I provide the actual priors (text/conditioning) to GPTInferenceModel here. I cannot provide latent vectors to the HF call, you must provide longs, otherwise something internal asserts. This is why these fake inputs are provided - to fool the HF library into thinking that it got a "meaningful" prior, when in fact the GPT2InferenceModel is just going to replace that fake prior with the actual one once it gets called.

1 reply

Malik7115 Aug 24, 2022
Author

Hey thanks for the detailed explanation. It took me some time to read some stuff and I also got a bit sidetracked to something else. I have put the AR model on train but there are some weird issues I am facing. Here are the details:

As discussed before I went with this padding approach, [conditioning][text][pad][voice][pad]. And the corresponding masks were calculated
Can I get an idea on how the AR model is performing, if I feed the inferred mel_codes to the vae decoder. If so, I am getting really weird outputs, even after a large number of elapsed steps (I have attached the images for reference)
You mentioned Alignment, am I to do anything explicitly about it? If so could you please elaborate
I conducted another experiment with fixed mel legths. The output was not weird as in 2 but it produced gibberish, after decoding from vae
Keeping 2 and 3 in retrospect, should I train the AR model until it converges and train diffusion on inferred codes or is there something else I should keep in mind?

neonbjb · 2022-08-24T16:55:18Z

neonbjb
Aug 24, 2022
Maintainer

Can I get an idea on how the AR model is performing, if I feed the inferred mel_codes to the vae decoder. If so, I am getting really weird outputs, even after a large number of elapsed steps (I have attached the images for reference)
Yes, that should work (although it'll sound like shit). I can't really tell you exactly what is going wrong, but the problem is either your vqvae decoder or your AR model. You can test the VQVAE but encoding and decoding some known good audio (basically autoencode it), and see what pops out. If that's significantly different from what you pictured above, it's more likely that the AR model might have something wrong with it. If it's the same, your VQVAE decoder might not be good enough for this task.

You mentioned Alignment, am I to do anything explicitly about it? If so could you please elaborate
I only meant that when you train your diffusion model to go from codes->mel spectrograms, you need to be sure that the codes you feed the diffusion model are aligned with the target audio spectrograms. It's probably a pretty obvious statement, but I got burned by a software bug on my end and wasted a couple of weeks of compute on something like this.

I conducted another experiment with fixed mel legths. The output was not weird as in 2 but it produced gibberish, after decoding from vae

Not exactly sure what you mean

Keeping 2 and 3 in retrospect, should I train the AR model until it converges and train diffusion on inferred codes or is there something else I should keep in mind?

I actually do think you should train a diffusion model or some sort of effective decoder first before moving on to the AR model. That way you will be better able to determine how well your AR training is going.

1 reply

Malik7115 Aug 26, 2022
Author

thanks once again...and damn these bugs....The problem was me specifying incorrect calm tokens, which varies for each ckpt of vae. this caused the incorrect decoding from the vae (excuse me while I bang my head). Anyways fingers crossed, I have restarted the training for AR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AR Model Details/Discussion #139

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AR Model Details/Discussion #139

Malik7115 Aug 6, 2022

Replies: 2 comments · 2 replies

neonbjb Aug 6, 2022 Maintainer

Malik7115 Aug 24, 2022 Author

neonbjb Aug 24, 2022 Maintainer

Malik7115 Aug 26, 2022 Author

Malik7115
Aug 6, 2022

Replies: 2 comments 2 replies

neonbjb
Aug 6, 2022
Maintainer

Malik7115 Aug 24, 2022
Author

neonbjb
Aug 24, 2022
Maintainer

Malik7115 Aug 26, 2022
Author