Replies: 2 comments 2 replies
-
|
Beta Was this translation helpful? Give feedback.
-
Not exactly sure what you mean
I actually do think you should train a diffusion model or some sort of effective decoder first before moving on to the AR model. That way you will be better able to determine how well your AR training is going. |
Beta Was this translation helpful? Give feedback.
-
Hey.
So the dopamine spike from my half-baked Diff model has subsided now and I want a bigger kick so starting the AR model now xD.
Before I start I would love your input and help on the following.
for a precursor to the actual stuff I was thinking of learning only the prior not conditioned on texts. It will not give a coherent speech but would you suggest I go with this? Basically something like this
How do you apply a classification head for the transformer for two separate sequence types. One has the total categories == vocab_size and one has total_categories == 8194 (num of mel codes and start-stop tokens).
In creating the dataset/sequence do you pad at the end of this [conditioning_point_latent, text_embeddings, voice_embeddings] or do you pad after text_embeddings to make all sequences in a batch to be of the same length? I was thinking that this might also work if we provide the attention masks are input accordingly, since only the places where the mask is applied get ignored in the softmax function in the attention mechanism.
I tried to go through your implementation and understand some things, but found some parts to be quite recondite (for me only, probably not for others!) What exactly is happening in this function. Why is there a fake input thats being passed here and not the actual text tokens or other meaningful stuff. How exactly does the model generate the relevant mel codes using only fake inputs as prompts
Beta Was this translation helpful? Give feedback.
All reactions