Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are the model errors on purpose as a demo? #150

Open
mytait opened this issue Feb 23, 2025 · 2 comments
Open

are the model errors on purpose as a demo? #150

mytait opened this issue Feb 23, 2025 · 2 comments

Comments

@mytait
Copy link

mytait commented Feb 23, 2025

tldr: the offline model has several problems that the online demo (with a paid option) does not have:

-generation is cut at 30 seconds

-random long silence or dropping of words or whole sentences

-random word slurring or noises

in my own setup, every third generation is unusable.

all these problems make the model unusable for serious applications.

the problems are not documented anywhere but people found out after installation and usage.

this forum is full of people complaining about and spending a lot of time trying to fix it by various means with the little optiona available.

yet the online demo does not have such problems:
all generations are quite flawless and there is no time limit.

yet it seems we dont have access to the parameters or code used in the online demo.

this leads to the question:

are these problems on purpose? is this model just a "shareware" demo of the paid service? back then shareware demos were little demos meant as advertisement which contained just a part of the product or had hinderances built in so that the product was unusable apart from small "demo" use.

i havent seen a dev comment of the problems. its mostly users saying that its version 0.1 and later versions will improve. which in general is what you would expect.

maybe i am terribly mistaken and the online demo also has flaws that i havent seen.

could the original authors give an statement?

@coezbek
Copy link

coezbek commented Feb 24, 2025

@mytait I am also exploring the model and am a bit underwhelmed what it can do out of the box.

  • Can you share a bit what you have tried?
  • Are you using the transformer model or the hybrid one?
  • Which language are you trying to generate sounds for? The docs say that the dataset is predominantly English with 'substantial' data in Chinese, Japanese, French, Spanish, and German. I am trying to have it speak German and wonder how much data they really used.

Regarding the 30s limitation: This seems hard coded at the moment and the idea is to do repeated generations which you string together.

@mytait
Copy link
Author

mytait commented Feb 24, 2025

the best quality out of the box is using the gradio app. i am using transformer

the inference code lacks a lot of the functionality in the gradio app: notably that the gradio demo uses an audio prefix of 100ms silence. this dramatically improves the quality.. still the usability is bad. Also you should set the seed manually to a seed that you know works.. this is trial and error. Set the emotions to inconditional and dont use them. this leaves most of the work to finding a good seed. in general use the settings from the gradio app and dont change anything.. if your audio gets even close to 30 seconds then errors appear and sentences get dropped. so short texts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants