Zonos-v0.1

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

For more details and speech samples, check out our blog here

We also have a hosted version available at maia.zyphra.com/audio

Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.

Usage

Python

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

Gradio interface (recommended)

python gradio_interface.py

FOR WINDOWS: Gradio will ask you to use adress: 0.0.0.0:7860. That does not work! use http://127.0.0.1:7860/ instead

For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run.

Features

Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
Fast: our model runs with a real-time factor of ~2x on an RTX 4090 (i.e. generates 2 seconds of audio per 1 second of compute time)
Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.

Installation

Setup environment

create a python 3.10 environment and clone the repository into it.

Zonos depends on the eSpeak library phonemization.

Linux:

apt install -y espeak-ng

windows:

either run this in a command shell with administrator rights:

winget install --id=eSpeak-NG.eSpeak-NG  -e

or use the lastest installer from their github:

https://github.com/espeak-ng/espeak-ng/releases

Install project

pip install -r requirements1.txt -r requirements2.txt

Confirm that it's working

For convenience we provide a minimal example to check that the installation works:

python sample.py

Docker installation

git clone https://github.com/Zyphra/Zonos.git
cd Zonos

# For gradio
docker compose up

# Or for development you can do
docker build -t zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t zonos
cd /Zonos
python sample.py # this will generate a sample.wav in /Zonos

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
assets		assets
zonos		zonos
.gitignore		.gitignore
.python-version		.python-version
CONDITIONING_README.md		CONDITIONING_README.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
gradio_interface.py		gradio_interface.py
pyproject.toml		pyproject.toml
requirements1.txt		requirements1.txt
requirements2.txt		requirements2.txt
sample.py		sample.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zonos-v0.1

For more details and speech samples, check out our blog here

We also have a hosted version available at maia.zyphra.com/audio

Usage

Python

Gradio interface (recommended)

Features

Installation

Setup environment

Install project

Confirm that it's working

Docker installation

About

Releases

Packages

Languages

License

mytait/Zonos

Folders and files

Latest commit

History

Repository files navigation

Zonos-v0.1

For more details and speech samples, check out our blog here

We also have a hosted version available at maia.zyphra.com/audio

Usage

Python

Gradio interface (recommended)

Features

Installation

Setup environment

Install project

Confirm that it's working

Docker installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages