"Which is the best model?" / "Which model should I use?" - answer from the owner of this repo #133

beveradb · 2024-10-13T20:41:04Z

beveradb
Oct 13, 2024
Maintainer

"Which is the best model for X?"

This is a difficult question to answer, and I'm a little tired of repeating myself as this gets asked a lot, so I'm making this discussion thread as a way for me to share and update my public opinion on this whenever it changes.
I'd also like it to be a place for people to learn, share their experiences and update the community 😄
These are my current recommendations as of October 2024; I'll try to update this whenever we gain support for new models.

In short: there is no single model which is always the best option! It depends on a variety of inputs.

Below, I will explain each of these decision making inputs separately, so you can make an informed decision for yourself.

What you are trying to achieve

This may seem obvious, but people are using audio-separator for a wide variety of purposes.

Let's look at a few desired outputs, and I'll give examples of models to try.

Input audio: mixed music with vocals (e.g. commercial music heard on the radio).

Goal: mixed instrumental music without any vocals

This is a common use case, and there are many models to choose from which serve this purpose.
Which one provides the best results for you will depend on the other inputs below (input audio, performance/hardware requirements).

For this purpose, I currently recommend one of these models:

model_bs_roformer_ep_317_sdr_12.9755.ckpt
MDX23C-8KFFT-InstVoc_HQ_2.ckpt
UVR-MDX-NET-Inst_HQ_4.onnx
2_HP-UVR.pth

Goal: mixed vocals

This may seem the same as the previous use case as most of the two-stem instrumental models also output a vocals stem, but it is not.
It is definitely still worth trying the instrumental models above, especially the latest supported architecture (Roformer) e.g. model_bs_roformer_ep_317_sdr_12.9755.ckpt.

However, several models have been trained specifically to focus on outputting clear vocals, so if your focus is on vocals, these are worth trying:

Kim_Vocal_2.onnx
kuielab_a_vocals.onnx
4_HP-Vocal-UVR.pth

Goal: mixed instrumental music with backing vocals (BV)

This is my primary use case, as I make karaoke videos daily and I enjoy singing karaoke with backing vocals retained.

For this purpose, I currently recommend one of these models:

mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt (full spectrum, very crisp but frequently retains too much lead vocals too)
UVR_MDXNET_KARA_2.onnx (great at retaining only backing vocals, but has a frequency cap so output sounds a little muddy)
6_HP-Karaoke-UVR.pth (I haven't used this lots but a couple of quick tests gave pretty good results)

But another approach which will often provide better results is to use a two-step approach.
First, separate the mixed vocals from the instrumental (e.g. using model_bs_roformer_ep_317_sdr_12.9755.ckpt or Kim_Vocal_2.onnx), then separate the backing vocals from the lead vocals using a karaoke / backing vocals model, such as one of the karaoke models above, or one specifically designed for that purpose, such as:

UVR-BVE-4B_SN-44100-1.pth

Goal: individual instrument stems

This is useful for musicians looking to sample specific parts of other music, folks practicing a particular part (e.g. a bass riff or guitar solo), or just curious folks looking to learn and experience music in another way.

For this purpose, I currently recommend one of these models:

htdemucs_6s.yaml (this produces 6 stems: Drums, Bass, Guitar, Piano, Other, Vocals)
kuielab_a_bass.onnx
kuielab_a_drums.onnx
kuielab_a_other.onnx

Goal: remove background noise from a recording

Noise removal is not an area I have much experience in, but there are some models designed for this purpose.
I've tested a few and had some interesting results, but your experience may vary depending on the kind of noise etc.

denoise_mel_band_roformer_aufr33_sdr_27.9959.ckpt
denoise_mel_band_roformer_aufr33_aggr_sdr_27.9768.ckpt
UVR-DeNoise.pth
UVR-DeNoise-Lite.pth

Goal: remove reverb and/or echo from a recording

Reverb / echo removal is not an area I have much experience in, but there are some models designed for this purpose.
I've tested a few and had some impressive results, but your experience may vary depending on the input audio.

deverb_bs_roformer_8_384dim_10depth.ckpt
Reverb_HQ_By_FoxJoy.onnx
UVR-DeEcho-DeReverb.pth

These are specifically labeled as removing echo; whether they also remove reverb or if there's a lot of overlap is unclear to me:

UVR-De-Echo-Normal.pth
UVR-De-Echo-Aggressive.pth

What input audio you are using

Different styles of input audio will get better results with different models, and it's very hard for me to give any consistent rule to follow here.

Just try a few different models (ideally from a few different architectures e.g. Roformer, MDXC, VR, MDX, Demucs) and see what sounds best to your ears!

In general though, for the last few months, the Roformer model model_bs_roformer_ep_317_sdr_12.9755.ckpt has been my go-to for a clean, full-spectrum separation for most input audio.

I sometimes get better results from MDX23C-8KFFT-InstVoc_HQ_2.ckpt, and in some niche songs e.g. much older audio I think the VR architecture models e.g. 2_HP-UVR.pth actually deliver better results.

Your speed vs. quality vs. hardware trade-off

Everyone wants separation to be fast. But fast is subjective, and processing speed depends hugely on your environment, hardware, and model choice.

If you've chosen a model which delivers great quality results to your ears, but it takes too long to process for your use case, you might be able to "throw money at it" by running the inference on a machine with a powerful Nvidia GPU with CUDA.
That's what I do for one of my karaoke use cases - audio-separator runs on a Runpod serverless endpoint with an A100 GPU.

If you're just doing instrumental/vocals separation, you also have a lot of models to choose from with 5 different architectures, and each of those have dramatic performance differences.

For example, on my current laptop (a 2023 Macbook Pro with M3 Max CPU/GPU), doing instrumental/vocals separation with the same 3 minute pop song has wildly different processing times for different models:

audio-separator -d -m 2_HP-UVR.pth test.flac: Separation duration: 00:00:19
audio-separator -d -m UVR-MDX-NET-Inst_HQ_4.onnx test.flac: Separation duration: 00:00:36
audio-separator -d -m model_bs_roformer_ep_317_sdr_12.9755.ckpt test.flac: Separation duration: 00:01:49
audio-separator -d -m MDX23C-8KFFT-InstVoc_HQ_2.ckpt test.flac: Separation duration: 00:02:37

For my purpose (making karaoke videos), the better quality separation offered by the Roformer and MDX23C models are worth the extra inference time - but for another person or use case, they might consider the 2_HP-UVR model as it provides "good enough" separation with much faster inference.

Hope this helps!

If you disagree with anything I've written here or have better advice to offer folks, please comment below and tag me!

OneSeven · 2024-10-17T08:48:38Z

OneSeven
Oct 17, 2024

@beveradb
The UVR-MDX-NET-Inst_HQ_4.onnx model can also output great vocals through the UVR GUI, but not through the audio-separator. There is a huge difference between the two.

1 reply

beveradb Oct 17, 2024
Maintainer Author

I understand that, but I don't have enough free (unpaid, hobby) time or motivation to investigate and figure out why, especially since other models do a better job for the use cases above.

That's why I suggested you could investigate it yourself (in the issue you closed).

If you decide you or another engineer would like to try and tackle the investigate why there is some (reduced-volume) instrumental audio left over in the vocal stem output by models UVR-MDX-NET-Inst_HQ_3.onnx or UVR-MDX-NET-Inst_HQ_4.onnx etc., I would encourage you to open a specific issue about that, and I am more than happy to schedule a zoom/google meet call to talk through the code, try to help you get up to speed and pair program on it :)

beveradb · 2024-12-18T05:59:24Z

beveradb
Dec 18, 2024
Maintainer Author

Update: in the latest version of audio-separator (version 0.28.1 or newer) there is now more information about supported models provided by the -l CLI parameter.

It's documented here but basically, you can now filter and sort by SDR score for different stems.

e.g. if you want to find the best vocal models:

audio-separator -l --list_filter=vocals --list_limit=5
------------------------------------------------------------------------------------------------------------------------------------------------------------
Model Filename                             Arch  Output Stems (SDR)                   Friendly Name
------------------------------------------------------------------------------------------------------------------------------------------------------------
model_bs_roformer_ep_317_sdr_12.9755.ckpt  MDXC  vocals* (12.9), instrumental (17.0)  Roformer Model: BS-Roformer-Viperx-1297
model_bs_roformer_ep_368_sdr_12.9628.ckpt  MDXC  vocals* (12.9), instrumental (17.0)  Roformer Model: BS-Roformer-Viperx-1296
vocals_mel_band_roformer.ckpt              MDXC  vocals* (12.6), other                Roformer Model: MelBand Roformer | Vocals by Kimberley Jensen
melband_roformer_big_beta4.ckpt            MDXC  vocals* (12.5), other                Roformer Model: MelBand Roformer Kim | Big Beta 4 FT by unwa
mel_band_roformer_kim_ft_unwa.ckpt         MDXC  vocals* (12.4), other                Roformer Model: MelBand Roformer Kim | FT by unwa

or the best instrumental models:

audio-separator -l --list_filter=instrumental --list_limit=5
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Model Filename                                            Arch  Output Stems (SDR)                   Friendly Name
----------------------------------------------------------------------------------------------------------------------------------------------------------------
model_bs_roformer_ep_317_sdr_12.9755.ckpt                 MDXC  vocals* (12.9), instrumental (17.0)  Roformer Model: BS-Roformer-Viperx-1297
model_bs_roformer_ep_368_sdr_12.9628.ckpt                 MDXC  vocals* (12.9), instrumental (17.0)  Roformer Model: BS-Roformer-Viperx-1296
MDX23C-8KFFT-InstVoc_HQ_2.ckpt                            MDXC  vocals (12.2), instrumental (16.3)   MDX23C Model VIP: MDX23C-InstVoc HQ 2
MDX23C-8KFFT-InstVoc_HQ.ckpt                              MDXC  vocals (12.2), instrumental (16.2)   MDX23C Model: MDX23C-InstVoc HQ
mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt  MDXC  vocals* (9.0), instrumental (16.0)   Roformer Model: Mel-Roformer-Karaoke-Aufr33-Viperx

or find out what options you have to isolate a specific stem, e.g. drums or bass:

audio-separator -l --list_filter=drums
-----------------------------------------------------------------------------------------------------------------------------------
Model Filename        Arch    Output Stems (SDR)                                            Friendly Name
-----------------------------------------------------------------------------------------------------------------------------------
htdemucs_ft.yaml      Demucs  vocals (10.8), drums (10.1), bass (11.9), other               Demucs v4: htdemucs_ft
hdemucs_mmi.yaml      Demucs  vocals (10.3), drums (9.7), bass (12.0), other                Demucs v4: hdemucs_mmi
htdemucs.yaml         Demucs  vocals (10.0), drums (9.4), bass (11.3), other                Demucs v4: htdemucs
htdemucs_6s.yaml      Demucs  vocals (9.7), drums (8.5), bass (10.0), guitar, piano, other  Demucs v4: htdemucs_6s
kuielab_a_drums.onnx  MDX     drums* (7.7), no drums                                        MDX-Net Model: kuielab_a_drums
kuielab_b_drums.onnx  MDX     drums* (7.1), no drums                                        MDX-Net Model: kuielab_b_drums

The unfiltered list (audio-separator -l) lists all stems provided by all models, and those with an asterisk are the "target stem" for that model. You can use this to find more niche stems which can be separated by specific models, e.g:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Model Filename                                                      Arch    Output Stems (SDR)                                            Friendly Name
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
17_HP-Wind_Inst-UVR.pth                                             VR      no woodwinds*, woodwinds                                      VR Arch Single Model v5: 17_HP-Wind_Inst-UVR
UVR-De-Echo-Aggressive.pth                                          VR      no echo*, echo                                                VR Arch Single Model v5: UVR-De-Echo-Normal by FoxJoy
UVR-DeEcho-DeReverb.pth                                             VR      no reverb*, reverb                                            VR Arch Single Model v5: UVR-DeEcho-DeReverb by FoxJoy
UVR-DeNoise-Lite.pth                                                VR      noise*, no noise                                              VR Arch Single Model v5: UVR-DeNoise-Lite by FoxJoy
UVR-DeNoise.pth                                                     VR      noise*, no noise                                              VR Arch Single Model v5: UVR-DeNoise by FoxJoy
UVR-De-Reverb-aufr33-jarredou.pth                                   VR      dry*, no dry                                                  VR Arch Single Model v4: UVR-De-Reverb by aufr33-jarredou
Reverb_HQ_By_FoxJoy.onnx                                            MDX     reverb*, no reverb                                            MDX-Net Model: Reverb HQ By FoxJoy
UVR-MDX-NET_Crowd_HQ_1.onnx                                         MDX     no crowd*, crowd                                              MDX-Net Model: UVR-MDX-NET Crowd HQ 1 By Aufr33
kuielab_b_other.onnx                                                MDX     other*, no other                                              MDX-Net Model: kuielab_b_other
MDX23C-De-Reverb-aufr33-jarredou.ckpt                               MDXC    dry, no dry                                                   Roformer Model: Mel-Roformer-Karaoke-Aufr33-Viperx
denoise_mel_band_roformer_aufr33_sdr_27.9959.ckpt                   MDXC    dry*, other                                                   Roformer Model: Mel-Roformer-Denoise-Aufr33-Aggr
mel_band_roformer_crowd_aufr33_viperx_sdr_8.7144.ckpt               MDXC    crowd*, other                                                 Roformer Model: Mel-Roformer-Crowd-Aufr33-Viperx
deverb_bs_roformer_8_384dim_10depth.ckpt                            MDXC    noreverb*, reverb                                             Roformer Model: MelBand Roformer Kim | Inst V1 (E) by Unwa
dereverb_mel_band_roformer_anvuew_sdr_19.1729.ckpt                  MDXC    noreverb*, reverb                                             Roformer Model: MelBand Roformer | De-Reverb-Echo by Sucial
dereverb-echo_mel_band_roformer_sdr_13.4843_v2.ckpt                 MDXC    dry*, no dry                                                  Roformer Model: MelBand Roformer Kim | Big Beta 5e FT by unwa
model_chorus_bs_roformer_ep_267_sdr_24.1275.ckpt                    MDXC    male, female                                                  Roformer Model: BS Roformer | Chorus Male-Female by Sucial
aspiration_mel_band_roformer_sdr_18.9845.ckpt                       MDXC    aspiration, other                                             Roformer Model: MelBand Roformer | Aspiration Less Aggressive by Sucial
mel_band_roformer_bleed_suppressor_v1.ckpt                          MDXC    instrumental*, bleed                                          Roformer Model: MelBand Roformer | Bleed Suppressor V1 by unwa-97chris

The SDR scores are calculated using the museval tool with the MUSDB18 dataset, using this script which I've allowed to run for numerous days, processing and evaluating every track for every model 🤯

Hope this helps! 🙇

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Which is the best model?" / "Which model should I use?" - answer from the owner of this repo #133

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

"Which is the best model?" / "Which model should I use?" - answer from the owner of this repo #133

beveradb Oct 13, 2024 Maintainer

"Which is the best model for X?"

What you are trying to achieve

Goal: mixed instrumental music without any vocals

Goal: mixed vocals

Goal: mixed instrumental music with backing vocals (BV)

Goal: individual instrument stems

Goal: remove background noise from a recording

Goal: remove reverb and/or echo from a recording

What input audio you are using

Your speed vs. quality vs. hardware trade-off

Replies: 2 comments · 1 reply

OneSeven Oct 17, 2024

beveradb Oct 17, 2024 Maintainer Author

beveradb Dec 18, 2024 Maintainer Author

beveradb
Oct 13, 2024
Maintainer

Replies: 2 comments 1 reply

OneSeven
Oct 17, 2024

beveradb Oct 17, 2024
Maintainer Author

beveradb
Dec 18, 2024
Maintainer Author