"Which is the best model?" / "Which model should I use?" - answer from the owner of this repo #133
Replies: 2 comments 1 reply
-
@beveradb |
Beta Was this translation helpful? Give feedback.
-
Update: in the latest version of It's documented here but basically, you can now filter and sort by SDR score for different stems. e.g. if you want to find the best vocal models:
or the best instrumental models:
or find out what options you have to isolate a specific stem, e.g. drums or bass:
The unfiltered list (
The SDR scores are calculated using the Hope this helps! 🙇 |
Beta Was this translation helpful? Give feedback.
-
"Which is the best model for X?"
This is a difficult question to answer, and I'm a little tired of repeating myself as this gets asked a lot, so I'm making this discussion thread as a way for me to share and update my public opinion on this whenever it changes.
I'd also like it to be a place for people to learn, share their experiences and update the community 😄
These are my current recommendations as of October 2024; I'll try to update this whenever we gain support for new models.
In short: there is no single model which is always the best option! It depends on a variety of inputs.
Below, I will explain each of these decision making inputs separately, so you can make an informed decision for yourself.
What you are trying to achieve
This may seem obvious, but people are using
audio-separator
for a wide variety of purposes.Let's look at a few desired outputs, and I'll give examples of models to try.
Input audio: mixed music with vocals (e.g. commercial music heard on the radio).
Goal: mixed instrumental music without any vocals
This is a common use case, and there are many models to choose from which serve this purpose.
Which one provides the best results for you will depend on the other inputs below (input audio, performance/hardware requirements).
For this purpose, I currently recommend one of these models:
model_bs_roformer_ep_317_sdr_12.9755.ckpt
MDX23C-8KFFT-InstVoc_HQ_2.ckpt
UVR-MDX-NET-Inst_HQ_4.onnx
2_HP-UVR.pth
Goal: mixed vocals
This may seem the same as the previous use case as most of the two-stem instrumental models also output a vocals stem, but it is not.
It is definitely still worth trying the instrumental models above, especially the latest supported architecture (Roformer) e.g.
model_bs_roformer_ep_317_sdr_12.9755.ckpt
.However, several models have been trained specifically to focus on outputting clear vocals, so if your focus is on vocals, these are worth trying:
Kim_Vocal_2.onnx
kuielab_a_vocals.onnx
4_HP-Vocal-UVR.pth
Goal: mixed instrumental music with backing vocals (BV)
This is my primary use case, as I make karaoke videos daily and I enjoy singing karaoke with backing vocals retained.
For this purpose, I currently recommend one of these models:
mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt
(full spectrum, very crisp but frequently retains too much lead vocals too)UVR_MDXNET_KARA_2.onnx
(great at retaining only backing vocals, but has a frequency cap so output sounds a little muddy)6_HP-Karaoke-UVR.pth
(I haven't used this lots but a couple of quick tests gave pretty good results)But another approach which will often provide better results is to use a two-step approach.
First, separate the mixed vocals from the instrumental (e.g. using
model_bs_roformer_ep_317_sdr_12.9755.ckpt
orKim_Vocal_2.onnx
), then separate the backing vocals from the lead vocals using a karaoke / backing vocals model, such as one of the karaoke models above, or one specifically designed for that purpose, such as:UVR-BVE-4B_SN-44100-1.pth
Goal: individual instrument stems
This is useful for musicians looking to sample specific parts of other music, folks practicing a particular part (e.g. a bass riff or guitar solo), or just curious folks looking to learn and experience music in another way.
For this purpose, I currently recommend one of these models:
htdemucs_6s.yaml
(this produces 6 stems:Drums
,Bass
,Guitar
,Piano
,Other
,Vocals
)kuielab_a_bass.onnx
kuielab_a_drums.onnx
kuielab_a_other.onnx
Goal: remove background noise from a recording
Noise removal is not an area I have much experience in, but there are some models designed for this purpose.
I've tested a few and had some interesting results, but your experience may vary depending on the kind of noise etc.
denoise_mel_band_roformer_aufr33_sdr_27.9959.ckpt
denoise_mel_band_roformer_aufr33_aggr_sdr_27.9768.ckpt
UVR-DeNoise.pth
UVR-DeNoise-Lite.pth
Goal: remove reverb and/or echo from a recording
Reverb / echo removal is not an area I have much experience in, but there are some models designed for this purpose.
I've tested a few and had some impressive results, but your experience may vary depending on the input audio.
deverb_bs_roformer_8_384dim_10depth.ckpt
Reverb_HQ_By_FoxJoy.onnx
UVR-DeEcho-DeReverb.pth
These are specifically labeled as removing echo; whether they also remove reverb or if there's a lot of overlap is unclear to me:
UVR-De-Echo-Normal.pth
UVR-De-Echo-Aggressive.pth
What input audio you are using
Different styles of input audio will get better results with different models, and it's very hard for me to give any consistent rule to follow here.
Just try a few different models (ideally from a few different architectures e.g. Roformer, MDXC, VR, MDX, Demucs) and see what sounds best to your ears!
In general though, for the last few months, the Roformer model
model_bs_roformer_ep_317_sdr_12.9755.ckpt
has been my go-to for a clean, full-spectrum separation for most input audio.I sometimes get better results from
MDX23C-8KFFT-InstVoc_HQ_2.ckpt
, and in some niche songs e.g. much older audio I think the VR architecture models e.g.2_HP-UVR.pth
actually deliver better results.Your speed vs. quality vs. hardware trade-off
Everyone wants separation to be fast. But fast is subjective, and processing speed depends hugely on your environment, hardware, and model choice.
If you've chosen a model which delivers great quality results to your ears, but it takes too long to process for your use case, you might be able to "throw money at it" by running the inference on a machine with a powerful Nvidia GPU with CUDA.
That's what I do for one of my karaoke use cases -
audio-separator
runs on a Runpod serverless endpoint with an A100 GPU.If you're just doing instrumental/vocals separation, you also have a lot of models to choose from with 5 different architectures, and each of those have dramatic performance differences.
For example, on my current laptop (a 2023 Macbook Pro with M3 Max CPU/GPU), doing instrumental/vocals separation with the same 3 minute pop song has wildly different processing times for different models:
For my purpose (making karaoke videos), the better quality separation offered by the Roformer and MDX23C models are worth the extra inference time - but for another person or use case, they might consider the 2_HP-UVR model as it provides "good enough" separation with much faster inference.
Hope this helps!
If you disagree with anything I've written here or have better advice to offer folks, please comment below and tag me!
Beta Was this translation helpful? Give feedback.
All reactions