are the model errors on purpose as a demo? #150

mytait · 2025-02-23T11:02:54Z

tldr: the offline model has several problems that the online demo (with a paid option) does not have:

-generation is cut at 30 seconds

-random long silence or dropping of words or whole sentences

-random word slurring or noises

in my own setup, every third generation is unusable.

all these problems make the model unusable for serious applications.

the problems are not documented anywhere but people found out after installation and usage.

this forum is full of people complaining about and spending a lot of time trying to fix it by various means with the little optiona available.

yet the online demo does not have such problems:
all generations are quite flawless and there is no time limit.

yet it seems we dont have access to the parameters or code used in the online demo.

this leads to the question:

are these problems on purpose? is this model just a "shareware" demo of the paid service? back then shareware demos were little demos meant as advertisement which contained just a part of the product or had hinderances built in so that the product was unusable apart from small "demo" use.

i havent seen a dev comment of the problems. its mostly users saying that its version 0.1 and later versions will improve. which in general is what you would expect.

maybe i am terribly mistaken and the online demo also has flaws that i havent seen.

could the original authors give an statement?

coezbek · 2025-02-24T14:43:43Z

@mytait I am also exploring the model and am a bit underwhelmed what it can do out of the box.

Can you share a bit what you have tried?
Are you using the transformer model or the hybrid one?
Which language are you trying to generate sounds for? The docs say that the dataset is predominantly English with 'substantial' data in Chinese, Japanese, French, Spanish, and German. I am trying to have it speak German and wonder how much data they really used.

Regarding the 30s limitation: This seems hard coded at the moment and the idea is to do repeated generations which you string together.

mytait · 2025-02-24T14:52:57Z

the best quality out of the box is using the gradio app. i am using transformer

the inference code lacks a lot of the functionality in the gradio app: notably that the gradio demo uses an audio prefix of 100ms silence. this dramatically improves the quality.. still the usability is bad. Also you should set the seed manually to a seed that you know works.. this is trial and error. Set the emotions to inconditional and dont use them. this leaves most of the work to finding a good seed. in general use the settings from the gradio app and dont change anything.. if your audio gets even close to 30 seconds then errors appear and sentences get dropped. so short texts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

are the model errors on purpose as a demo? #150

are the model errors on purpose as a demo? #150

mytait commented Feb 23, 2025

coezbek commented Feb 24, 2025

mytait commented Feb 24, 2025 •

edited

Loading

are the model errors on purpose as a demo? #150

are the model errors on purpose as a demo? #150

Comments

mytait commented Feb 23, 2025

coezbek commented Feb 24, 2025

mytait commented Feb 24, 2025 • edited Loading

mytait commented Feb 24, 2025 •

edited

Loading