-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101
base: main
Are you sure you want to change the base?
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101
Conversation
* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text
amazing improvement |
interesting approach with latens .. i spoke to the team 1-2 days ago about this .. internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got) a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory but you would eventually need some asr to prefix the text chunks too ideal solution is probably somewhere in the middle - thanks for that approach |
I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent? What's ASR is that context? |
asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer |
Ah whisper, gotcha. Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this? |
no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already ! |
This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length. |
I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet. |
@@ -10,7 +10,10 @@ services: | |||
network_mode: "host" | |||
stdin_open: true | |||
tty: true | |||
command: ["python3", "gradio_interface.py"] | |||
command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation
Overview
This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.
Important Usage Notes
Key Changes
Core Generation
Gradio Interface
Technical Implementation
Misc.
Limitations/Improvements Needed
Examples
With Latent Windowing (123 seconds)
latent.windowing.4.mp4
Without Latent Windowing (46 seconds)
regular_3.mp4