2023.08 迄今最強(qiáng)的開(kāi)源免費(fèi)人聲分離解決方案MVSEP-MDX23,趕緊把你的Spleeter丟了

這篇文章定位面向:
熱衷于翻唱非華語(yǔ)歌曲卻苦于找不到高質(zhì)量Karaoke曲庫(kù)的歌勢(shì)、
對(duì)干音分離有需求的二創(chuàng)作者、
扒譜音樂(lè)愛(ài)好者
這篇文章將會(huì)教你:
梳理Music & Voice Separation算法的發(fā)展過(guò)程
如何跟進(jìn)行業(yè)最前沿的音源分離算法,追求生產(chǎn)品質(zhì)的極限完美分離(需要有一定的動(dòng)手能力)
部署MVSEP-MDX23
并且,本文介紹的所有軟件/方案都是免費(fèi)的/開(kāi)源的學(xué)術(shù)成果,以表達(dá)對(duì)圈地收費(fèi)的某些商業(yè)產(chǎn)品的尊重:我尊重你的成果,但不會(huì)掏錢(qián)。
發(fā)展過(guò)程(對(duì)學(xué)術(shù)內(nèi)容不感興趣的可以跳過(guò))
在音樂(lè)領(lǐng)域,把成熟的混音工程逆向分離一直是炙手可熱的課題。音波混合的物理特性導(dǎo)致在沒(méi)有工程文件的情況下,將其還原是一件非常困難的事情。數(shù)年以來(lái)音源分離經(jīng)歷了大概幾個(gè)重要的發(fā)展階段:
中置聲道提取
中置聲道提取方法是由Vincent et al.在2007年的論文 "Harmonic and Inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch Transcription" 中提出的。這個(gè)方法用于從音頻中提取中置聲道(中心聲音)以及其他聲音源。它基于非負(fù)矩陣分解(Nonnegative Matrix Factorization,NMF)技術(shù),將音頻信號(hào)分解為多個(gè)音頻源的表示。
這個(gè)算法就是我們?cè)贏udition等常用音頻處理軟件中內(nèi)置的非常經(jīng)典的人聲分離手段之一,它建立在一個(gè)有趣的前提之下:絕大多數(shù)歌曲的生產(chǎn)過(guò)程都使用了單聲道(Mono)的Mic錄制人聲,再擴(kuò)展到LR聲道。這代表大多數(shù)情況下,人聲這個(gè)軌道在LR聲道的波形是完全一致的,因此中置聲道提取方法只會(huì)作用于LR聲道完全一致的聲音。
問(wèn)題在于,左右聲道完全一致的音頻遠(yuǎn)遠(yuǎn)不止人聲,并且這種方法對(duì)立體聲聲像相關(guān)的效果十分敏感。這兩個(gè)問(wèn)題導(dǎo)致了:
提取出的人聲混雜了相當(dāng)多的雜音,并且去除了人聲后的背景音有很強(qiáng)的中空感,有時(shí)候還會(huì)一并去除一些單聲道樂(lè)器;
遇到PAN特效的音源會(huì)失效。

Spleeter
Spleeter 是一種用于音頻源分離(音樂(lè)分離)的開(kāi)源深度學(xué)習(xí)算法,由Deezer研究團(tuán)隊(duì)開(kāi)發(fā)。在Deezer團(tuán)隊(duì)發(fā)布的論文《Spleeter: a fast and efficient music source separation tool with pre-trained models》中,他們對(duì)Spleeter的總結(jié)為一個(gè)性能取向的音源分離算法,并且為用戶提供了已經(jīng)預(yù)訓(xùn)練好的模型,能夠開(kāi)箱即用,這也是Spleeter爆火的原因之一。
其二是Deezer作為一個(gè)商業(yè)公司,在算法成果發(fā)布后迅速與其它產(chǎn)品進(jìn)行合作,將Spleeter帶到了iZotope RX、SpectralLayers、Acoustica、VirtualDJ、NeuralMix等知名專業(yè)音頻軟件中,大大提升了它的知名度。
作為早期的AI音源分離算法,Spleeter相對(duì)中置聲道提取的效果有質(zhì)的飛躍,并且首次讓普羅大眾能夠享受到普通電腦也能跑得動(dòng)的4音源分離模型,現(xiàn)在仍在作為很多音源分離比賽的Baseline。
We present and release a new tool for music source separation with pre-trained models called Spleeter. Spleeter was designed with ease of use, separation performance, and speed in mind. Spleeter is based on Tensorflow (Abadi, 2015) and makes it possible to:
split music audio files into several stems with a single command line using pre-trained models. A music audio file can be separated into 2 stems (vocals and accompaniments), 4 stems (vocals, drums, bass, and other) or 5 stems (vocals, drums, bass, piano and other).
train source separation models or fine-tune pre-trained ones with Tensorflow (provided you have a dataset of isolated sources).
The performance of the pre-trained models are very close to the published state-of-the-art and is one of the best performing 4 stems separation model on the common musdb18 benchmark (Rafii, Liutkus, St?ter, Mimilakis, & Bittner, 2017) to be publicly released. Spleeter is also very fast as it can separate a mix audio file into 4 stems 100 times faster than real-time (we note, though, that the model cannot be applied in real-time as it needs buffering) on a single Graphics Processing Unit (GPU) using the pre-trained 4-stems model

Demucs
Demucs來(lái)自Facebook Research團(tuán)隊(duì),它的發(fā)源晚于Spleeter,早于MDX-Net,并且經(jīng)歷過(guò)4個(gè)大版本的迭代,每一代的模型結(jié)構(gòu)都被大改。Demucs的生成質(zhì)量從v3開(kāi)始大幅質(zhì)變,一度領(lǐng)先行業(yè)平均水平,v4是現(xiàn)在最強(qiáng)的開(kāi)源樂(lè)器分離單模型,v1和v2的網(wǎng)絡(luò)模型被用作MDX-net其中的一部分。
Demucs v1 & v2的模型理論來(lái)自于《Music Source Separation in the Waveform Domain》:
Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation, to the task of music source separation. While Conv- Tasnet beats many existing spectrogram-domain methods, it suffers from significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model, with a U-Net structure and bidirectional LSTM. Ex- periments on the MusDB dataset show that, with proper data augmentation, Demucs beats all existing state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source). Using recent development in model quantization, Demucs can be compressed down to 120MB without any loss of accuracy. We also provide human evaluations, showing that Demucs benefit from a large advantage in terms of the naturalness of the audio. However, it suffers from some bleeding, especially between the vocals and other source.
與許多音頻合成任務(wù)不同,其中最佳性能由直接生成波形的模型實(shí)現(xiàn),音樂(lè)源分離的最新技術(shù)是在幅度譜上計(jì)算掩碼。在本文中,我們比較了兩種波形域架構(gòu)。首先,我們將最初用于語(yǔ)音源分離的Conv-Tasnet適應(yīng)到音樂(lè)源分離任務(wù)上。雖然Conv-Tasnet在許多現(xiàn)有的頻譜圖域方法中表現(xiàn)出色,但人類(lèi)評(píng)估結(jié)果顯示其存在顯著的偽影問(wèn)題。相反,我們提出了一種新穎的波形到波形模型Demucs,它具有U-Net結(jié)構(gòu)和雙向LSTM。在MusDB數(shù)據(jù)集上的實(shí)驗(yàn)表明,在適當(dāng)?shù)臄?shù)據(jù)增強(qiáng)條件下,Demucs在平均SDR上擊敗了所有現(xiàn)有的最先進(jìn)架構(gòu),包括Conv-Tasnet,在平均SDR上提升了6.3(甚至通過(guò)額外訓(xùn)練150首歌曲,最高可達(dá)6.8,甚至超過(guò)了貝斯源的IRM oracle)。利用模型量化的最新發(fā)展,Demucs可以壓縮到120MB,而不會(huì)損失任何準(zhǔn)確性。我們還提供了人類(lèi)評(píng)估結(jié)果,顯示Demucs在音頻的自然度方面具有顯著優(yōu)勢(shì)。然而,它在某些情況下會(huì)出現(xiàn)聲音“滲透”,特別是在人聲和其他源之間。
Demucs v3的模型理論來(lái)自于《Hybrid Spectrogram and Waveform Source Separation》:
Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture (Défossez et al., 2019) won the Music Demixing Challenge 2021 organized by Sony.
This architecture also comes with additional improvements, such as compressed residual branches, local attention or singular value regularization. Overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset (Rafii et al., 2019), an improvement confirmed by human subjective evaluation, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs and 2.44 for the second ranking model submitted at the competition).
音樂(lè)源分離模型可以在頻譜圖域或波形域上進(jìn)行操作。在本研究中,我們展示了如何進(jìn)行端到端的混合源分離,讓模型決定每個(gè)源最適合的域,甚至將兩者結(jié)合起來(lái)。提出的Demucs架構(gòu)(Défossez等,2019)的混合版本贏得了由索尼主辦的2021年音樂(lè)分離挑戰(zhàn)。
這個(gè)架構(gòu)還帶來(lái)了額外的改進(jìn),如壓縮殘差分支、局部注意力或奇異值正則化??傮w而言,在MusDB HQ數(shù)據(jù)集(Rafii等,2019)上對(duì)所有源的信號(hào)失真比(SDR)觀察到了1.4 dB的提高,這一改進(jìn)在人類(lèi)主觀評(píng)估中得到了證實(shí),整體質(zhì)量評(píng)分為5分制中的2.83分(非混合Demucs為2.36分),無(wú)污染評(píng)分為3.04分(非混合Demucs為2.37分,而在競(jìng)賽中排名第二的模型為2.44分)。
Demucs v4的模型理論來(lái)自于《Hybrid Spectrogram and Waveform Source Separation》:
A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB [3], we show that it outper-forms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse at-tention kernels? to extend its receptive field, and per sourcefine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.
在音樂(lè)源分離(MSS)領(lǐng)域中,一個(gè)自然的問(wèn)題是,長(zhǎng)程上下文信息是否有用,或者局部聲學(xué)特征是否足夠。在其他領(lǐng)域中,基于注意力機(jī)制的Transformer 已經(jīng)展示了它們整合長(zhǎng)序列信息的能力。在本研究中,我們引入了混合Transformer Demucs(HT Demucs),這是一個(gè)基于混合Demucs 架構(gòu)的混合時(shí)域/頻譜雙U-Net,其中最內(nèi)層被跨域Transformer編碼器所取代,該編碼器在一個(gè)域內(nèi)使用自注意力機(jī)制,在跨域之間使用交叉注意力機(jī)制。雖然當(dāng)僅在MUSDB 上訓(xùn)練時(shí)性能較差,但我們展示了當(dāng)使用額外的800個(gè)訓(xùn)練歌曲時(shí),它在SDR上比混合Demucs(在相同數(shù)據(jù)上訓(xùn)練)提高了0.45 dB。通過(guò)使用稀疏注意力核擴(kuò)展其感受野,并對(duì)每個(gè)源進(jìn)行微調(diào),我們?cè)谑褂妙~外訓(xùn)練數(shù)據(jù)時(shí)在MUSDB上實(shí)現(xiàn)了最先進(jìn)的結(jié)果,SDR達(dá)到了9.20 dB。
目前,Demucs v4是面向樂(lè)器分離效果最好的單模型,但由于其使用了一個(gè)龐大的Transformer結(jié)構(gòu),在推理時(shí)速度格外慢,而且顯存占用極高,必須要求有GPU才能完成推理。


MDX-net
一些最先進(jìn)的方法已經(jīng)表明,通過(guò)堆疊許多具有跳躍連接的層(stacking many layers with many skip connections),可以提高SDR性能。盡管這種深度且復(fù)雜的架構(gòu)表現(xiàn)出色,但通常需要大量的計(jì)算資源和時(shí)間進(jìn)行訓(xùn)練和評(píng)估。因而Minseok Kim等人提出了一種名為KUIELab-MDX-Net的音樂(lè)分離雙流神經(jīng)網(wǎng)絡(luò),展現(xiàn)了性能和所需資源之間的良好平衡。所提出的模型具有時(shí)頻分支和時(shí)域分支,每個(gè)分支分別分離音軌。它將兩個(gè)流的結(jié)果混合在一起生成最終的Estimation。KUIELab-MDX-Net在ISMIR 2021的音樂(lè)分離挑戰(zhàn)中在排行榜A上獲得第二名,在排行榜B上獲得第三名。
相比Spleeter只對(duì)頻域進(jìn)行處理,MDX-net同時(shí)對(duì)時(shí)域和頻域都有所涉及,并且通過(guò)復(fù)雜的UNet網(wǎng)絡(luò)使得分離質(zhì)量達(dá)成一個(gè)質(zhì)的飛躍。
作為學(xué)術(shù)界目前最受歡迎的AI音頻分離算法,現(xiàn)在在開(kāi)源社區(qū)已經(jīng)產(chǎn)生了非常多高質(zhì)量的、不同針對(duì)性的預(yù)訓(xùn)練模型。迄今為止MDX-NET-Voc_FT、Kim Vocal 2等預(yù)訓(xùn)練模型仍然在MVSEP排行榜中名列前茅。Kim Vocal系列預(yù)訓(xùn)練模型甚至能夠搭配其他模型提供主唱與和聲的分離能力,還有一些模型能夠從Reverb中提取出干音。



DLC: “團(tuán)子AI”
知名商業(yè)收費(fèi)產(chǎn)品,自稱“你們的SDR評(píng)分標(biāo)準(zhǔn)與人耳聽(tīng)感背道而馳,我自己干自己的,不跟你們卷刷分了,我最新的模型業(yè)界聽(tīng)感最強(qiáng)”,但是收費(fèi)、不發(fā)論文、不開(kāi)源,只在博客很籠統(tǒng)地描述了技術(shù)方法。
摘取Audio Separation Discord團(tuán)隊(duì)對(duì)團(tuán)子AI的評(píng)價(jià):
The combination of 3 different aggression settings (mostly the most aggressive in busy mix parts) gives the best results for Childish Gambino - Algorithm vs our top ensemble settings so far. But it's still far from ideal (and [not only] the most aggressive one makes instruments very muffled [but vocals are better cancelled too], although our separation makes it even a bit worse in more busy mix fragment).?
As for drums - better than GSEP, worse than Demucs 4 ft 32, although a bit better hihat. Not too easy track and already shows some diffrences between just GSEP and Demucs when the latter has more muffled hi-hats, but better snare, and it rather happens a lot of times
Also, it automatically picks the first fragment for preview when vocal appears, so it is difficult to write something like AS Tool for that (probably manipulations by manual mixing of fake vocals would be needed).
Very promising results, not gonna lie.
They wrote once somewhere about limited previews for stem mode (for more than 2 mode) and free credits, but haven’t encountered it yet.
They’re accused by aufr33 to use some of UVR models for 2 stems without crediting the source (and taking money for that).
成熟解決方案
介紹完了主流算法的發(fā)展歷史之后,我們能夠知道,目前效果最好的模型分別是Demucs v4和MDX-Net的衍生模型,他們對(duì)不同的音部有所側(cè)重。取決于你的需求,多數(shù)情況下在你僅有提取人聲的要求時(shí),使用MDX-NET Voc FT效果較好;而你要求對(duì)樂(lè)器進(jìn)行分離(比如扒譜等工程)時(shí),你更應(yīng)該去關(guān)注Demucs v4?htdemucs_ft這個(gè)模型。
UVR的使用在此不再贅述,其它地方已經(jīng)有很多現(xiàn)成的教程了。
在這里介紹幾個(gè)熱門(mén)模型的來(lái)源,以供剛摸到UVR時(shí)一頭霧水的玩家快速選出合適的模型:
Demucs V4
htdemucs
: first version of Hybrid Transformer Demucs. Trained on MusDB + 800 songs. Default model.htdemucs_ft
: fine-tuned version of?htdemucs
, separation will take 4 times more time but might be a bit better. Same training set as?htdemucs
.htdemucs_6s
: 6 sources version of?htdemucs
, with?piano
?and?guitar
?being added as sources. Note that the?piano
?source is not working great at the moment.hdemucs_mmi
: Hybrid Demucs v3, retrained on MusDB + 800 songs.mdx
: trained only on MusDB HQ, winning model on track A at the?MDX?challenge.mdx_extra
: trained with extra training data (including MusDB test set), ranked 2nd on the track B of the?MDX?challenge.mdx_q
,?mdx_extra_q
: quantized version of the previous models. Smaller download and storage but quality can be slightly worse.
Demucs V3
mdx
: trained only on MusDB HQ, winning model on track A at the?MDX?challenge.mdx_extra
: trained with extra training data (including MusDB test set), ranked 2nd on the track B of the?MDX?challenge.mdx_q
,?mdx_extra_q
: quantized version of the previous models. Smaller download and storage but quality can be slightly worse.?mdx_extra_q
?is the default model used.
MDX-Net
關(guān)于 MDX-UVR 模型,您需要知道的是它們分為Inst模型和Vocal模型,器樂(lè)模型總是會(huì)在人聲中留下一些器樂(lè)殘音,反之亦然——人聲模型更有可能在器樂(lè)中留下一些人聲殘音。但在一些特定的歌曲中,打破這一規(guī)則仍會(huì)對(duì)你有好處,這可能取決于特定的歌曲。通常情況下,如果要處理人聲殘留,樂(lè)器模型應(yīng)該能提供更好的樂(lè)器效果。此外,MDX-UVR 模型有時(shí)會(huì)拾取到無(wú)法恢復(fù)的 midi 音效。
kim vocal 1&2: Kimberley Jensen發(fā)布的針對(duì)Vocal fine-tuned的MDX比賽用模型,這個(gè)模型在Sound Demixing Challenge 2023的MDX'23比賽中獲得了第3名的成績(jī),窄帶模型,在生產(chǎn)領(lǐng)域只適用于人聲
kim Inst: 同理,Kimberley?Jensen發(fā)布的針對(duì)Instrument的模型,與 inst3/464 相比,它能獲得更清晰的結(jié)果和更好的 SDR,但有時(shí)也會(huì)產(chǎn)生更多噪音。這個(gè)模型是cutoff的,會(huì)切除17.7KHz以上的頻率,不適用于生產(chǎn),只適用于比賽刷分

Inst HQ 3:?全頻域的針對(duì)樂(lè)器的分離模型,目前為止細(xì)節(jié)效果在第一梯隊(duì),但是對(duì)弦樂(lè)的處理有問(wèn)題。HQ3 generally has problems with strings. mdx_extra from Demucs 3/4 had better result here, sometimes 6s model can be good compensation in ensemble for these lost instruments, but HQ3 gives some extra details compared to those.
同時(shí),HQ 3對(duì)部分吹奏樂(lè)器的處理也有一些問(wèn)題,處理笛子和小號(hào)的效果不如其它模型。
VOC FT: 目前為止效果最好的Vocal分離單模型,并且在MVSEP排行榜名列前茅,但是是窄帶模型
inst HQ_1 (450)/HQ_2 (498) (full band): 在大多數(shù)情況下,都能使用高質(zhì)量的模型。后者的 SDR 更好一些,人聲殘留可能更少一些。雖然不像 inst3 或kim ft那樣少,但也是一個(gè)很好的起點(diǎn)。
Inst 3: 窄帶模型,結(jié)果會(huì)更渾濁一些,但在某些情況下也會(huì)更平衡一些
Inst Main:?相比Inst 3對(duì)Vocal的殘留更多
后綴帶有Karaoke的系列: Vocal只去除主唱,保留和聲的模型,目前效果最好的是UVR-MDX-NET?Karoke 2
UVR-MDX-NET 1, UVR-MDX-NET 2,?UVR-MDX-NET?3:?UVR團(tuán)隊(duì)自訓(xùn)練的模型,用于Vocal分離,其中模型1獲得了9.703的SDR分?jǐn)?shù),2和3是減少參數(shù)的模型。這三個(gè)模型都有14.7kHZ的cutoff
行業(yè)前沿在哪里
答案是The Music Demixing (MDX) Challenge,由索尼牽頭每年舉辦的音源分離挑戰(zhàn)賽,你可以在這個(gè)比賽的Leaderboard看到當(dāng)前行業(yè)最前沿的算法和作者。

2023年的MDX挑戰(zhàn)賽現(xiàn)在C輪已經(jīng)結(jié)束,排名第一的是字節(jié)跳動(dòng)的SAMI(模型未放出),分?jǐn)?shù)斷層式領(lǐng)先于后面的選手。第二名ZFTurbo,則是本文將要介紹的MVSEP-MDX23的作者。第三名Kimberley Jensen,我們?cè)谏衔慕榻B過(guò)她的模型。

那么要在哪里找到能落地使用的算法呢?非常非常多人在第一次接觸到UVR5的時(shí)候,都會(huì)去問(wèn),最好的模型是哪一個(gè)?我瀏覽過(guò)中文互聯(lián)網(wǎng)上大部分的UVR教學(xué),包括b站的各種視頻和專欄,都是隨手給出一屏幕的搭配,教你這樣填就好,卻沒(méi)有人拿出真正的實(shí)戰(zhàn)數(shù)據(jù)來(lái)證明自己用的算法和參數(shù)相比別人更好,而對(duì)于多數(shù)用戶的電腦性能而言,去逐個(gè)驗(yàn)證又十分耗時(shí)。
在海外社區(qū)也有這樣的疑問(wèn),而實(shí)際上UVR5的開(kāi)發(fā)團(tuán)隊(duì)曾經(jīng)在issue中解釋過(guò)這個(gè)問(wèn)題:
The answer to the most asked question: What is the model which provides the best results? [Read this, very important info inside!]?#344
Hello everyone.
I would like to address a question I've repeatedly seen published both in this forum and other ones as well. Given the amount of available modules which have now been integrated into UVR, obviously a lot of people are confused as which one may provide the best results. The question I see a lot, therefore, is the following:
"What is the best module which provides the best results? What setting should I use with it?" and its variations.
Before I give you the answer, let me introduce to the following website: mvsep.com -- This is a website where you can upload a song of your choice and utilize all of the Stem Separation AI modules currently available to have it processed. I encourage you to check it out, it's an amazing tool. Keep in mind that due to high traffic, it is likely you will have to wait in a queue for your songs to be processed.
The developers over at Mvsep launched a very interesting initiative months ago, called "Quality Checker". As I mentioned before, there are plenty of modules available and Mvsep thought about a method to establish which of them offers the best results. This is done by downloading a standard database and have a given module process it, then uploading the results onto their site.
The results and corresponding metrics are published on their website. You can check them here: [一個(gè)鏈接]?-- This is called the "Leaderboard".
So, back to the question: Which module provides the best results? Well, you guessed it... The answer is provided by the Leaderboard itself. As you can see, there is no single module which offers the best results, but rather it is recommended to use a combination of modules. UVR has a function integrated within it called "Ensemble", which does exactly that: It processes a given song by utilizing one or more modules of your choice.
Now, back to the Leaderboard. At the time I'm writing this, the following combination provides the highest results:
MDX-Net: kim vocal model fine tuned (old) + UVR-MDX-NET_Main_427 + Demucs: v4 | htdemucs_ft - Ensemble Algorithm: Avg/Avg - Shifts: 10 - Overlap: 0.25
You notice they have used three different modules here (Kim vocal, MDX Net Main 427, and the latest fine-tuned demucs v4). If you hover your mouse to the "?" in the page corresponding to the combo, it also provides you with the UVR settings which were used to create the combo.
So, there you have it. You should check the Leaderboard page often to see which combo is getting the highest score, and then simply replicate it with UVR. Keep in mind that modules are constantly modified and/or trained, so it is likely the Leaderboard will change quite often.
Furthermore, you can provide your own methodology (combo) and results by visiting the Quality Checker page like I wrote above, download the database, and apply your own chosen modules, then uploading the final results. I strongly encourage everyone to do so: the more tests, the more results.
As a final note, I want to thank?@Anjok07?for his amazing job on UVR, which has now turned into a fantastic, and best tool at the world's disposal to create stems. Thanks a lot for all of your hard work!
大家好。
我想談?wù)勎以诒菊搲推渌搲戏磸?fù)看到的一個(gè)問(wèn)題。鑒于 UVR 中已集成了大量可用模型,顯然很多人都搞不清楚哪個(gè)模型能提供最佳效果。因此,我經(jīng)常遇到的問(wèn)題如下:
"哪個(gè)模塊效果最好?我應(yīng)該用什么設(shè)置?"以及各種不同的問(wèn)題。
在給出答案之前,請(qǐng)?jiān)试S我向您介紹以下網(wǎng)站:mvsep?-- 在這個(gè)網(wǎng)站上,您可以上傳一首自己選擇的歌曲,并使用目前可用的所有音源分離?AI 模型對(duì)其進(jìn)行處理。我鼓勵(lì)你去看看,這是一個(gè)非常棒的工具。請(qǐng)記住,由于流量很大,您可能需要排隊(duì)等待歌曲處理。
幾個(gè)月前,Mvsep 的開(kāi)發(fā)人員推出了一項(xiàng)非常有趣的計(jì)劃,名為 "Quality Checker"。正如我之前提到的,有很多可用的模型,而 Mvsep 想出了一種方法來(lái)確定哪個(gè)模塊的效果最好。具體做法是下載一個(gè)標(biāo)準(zhǔn)數(shù)據(jù)庫(kù),讓特定模型對(duì)其進(jìn)行處理,然后將結(jié)果上傳到他們的網(wǎng)站上。
結(jié)果和相應(yīng)的指標(biāo)會(huì)在他們的網(wǎng)站上公布。您可以在這里查看:MVSEP Leaderboard?-- 這就是所謂的 "排行榜"。
那么,回到問(wèn)題上來(lái): 哪個(gè)模型提供的結(jié)果最好?你猜對(duì)了... 答案由排行榜本身提供。正如你所看到的,沒(méi)有一個(gè)模型能提供最好的結(jié)果,而是建議使用模型組合。UVR 集成了一個(gè)名為 "Ensemble "的功能,它的作用正是如此: 它可以利用您選擇的一個(gè)或多個(gè)模型來(lái)處理指定歌曲。
現(xiàn)在,回到排行榜。在我寫(xiě)這篇文章時(shí),以下組合提供了最高的結(jié)果:
MDX-Net: kim vocal model fine tuned (old) + UVR-MDX-NET_Main_427 + Demucs: v4 | htdemucs_ft - Ensemble Algorithm: Avg/Avg - Shifts: 10 - Overlap: 0.25
你會(huì)發(fā)現(xiàn)他們?cè)谶@里使用了三個(gè)不同的模型(Kim vocal、MDX Net Main 427 和最新的fine-tuned?demucs v4)。如果將鼠標(biāo)懸停在組合對(duì)應(yīng)頁(yè)面的"? "上,還可以看到創(chuàng)建組合時(shí)使用的 UVR 設(shè)置。
就是這樣。您應(yīng)該經(jīng)常查看排行榜頁(yè)面,看看哪個(gè)組合得分最高,然后用 UVR 復(fù)制即可。請(qǐng)記住,模型是不斷修改和/或訓(xùn)練的,因此排行榜很可能會(huì)經(jīng)常變化。
此外,您還可以像我上面寫(xiě)的那樣,訪問(wèn)Quality Checker頁(yè)面,下載數(shù)據(jù)庫(kù),應(yīng)用自己選擇的模型,然后上傳最終結(jié)果,從而提供自己的方法(組合)和結(jié)果。我強(qiáng)烈建議大家這樣做:測(cè)試越多,結(jié)果越多。
最后,我要感謝 @Anjok07 在 UVR 上所做的出色工作,UVR 現(xiàn)在已經(jīng)成為世界上最好的Stem創(chuàng)建工具。非常感謝你的辛勤工作!
2022.12.26
簡(jiǎn)而言之,MVSEP就是音源分離的實(shí)戰(zhàn)天榜,它的成績(jī)一直在實(shí)時(shí)更新,直到我發(fā)布這篇文章的時(shí)候,MVSEP-MDX23創(chuàng)造的成績(jī)已經(jīng)被其它融合算法得分微微超過(guò)。
MVSEP的排行榜分Bass、Drums、Other、Vocals、Instrum,你想要得到哪個(gè)聲部更好的效果就去查看對(duì)應(yīng)的榜單,多數(shù)情況下關(guān)注Vocals和Instrum就已經(jīng)足夠。

我們可以看到,Vocal天榜目前仍舊是被Semi斷崖式領(lǐng)先(第一名是原數(shù)據(jù)集直接上傳獲取評(píng)分),排在其后的是基于MVSEP的融合模型以及本文的主角。所有成績(jī)都可以點(diǎn)進(jìn)去查看詳情,多數(shù)成績(jī)的提交者都會(huì)介紹他們這套算法的參數(shù):

混合方案是目前的終極答案
我們能夠在天榜發(fā)現(xiàn),前2頁(yè)除了Demucs v4以外基本已經(jīng)看不見(jiàn)單模型的影子。如Github issue所說(shuō),目前追求效果最好的方法就是將不同模型按照權(quán)重融合在一起,因此除了Semi以外,天榜名列前茅的成績(jī)無(wú)一例外都是多模型融合,對(duì)不同的聲部運(yùn)行對(duì)應(yīng)擅長(zhǎng)的模型,甚至用多個(gè)模型同時(shí)處理一個(gè)聲部,再按照權(quán)重混合。
MVSEP-MDX23是本屆MDX比賽目前第二名的ZFTurbo開(kāi)源的模型,Github地址ZFTurbo/MVSEP-MDX23-music-separation-model,它采用了特殊的模型融合方法因而與UVR5不兼容,但項(xiàng)目直接提供了Colab鏈接和Windows雙擊即用的Release版本:

但很遺憾,MVSEP-MDX23算法在參與比賽的時(shí)候是全力為Vocal優(yōu)化的,在Github issue中ZFTurbo承認(rèn)這個(gè)模型雖然表現(xiàn)極佳,在處理時(shí)對(duì)輸入的音頻切除了高頻并且原封不動(dòng)的還給Instrumental。

針對(duì)這個(gè)問(wèn)題,jarredou/MVSEP-MDX23-Colab_v2 fork了該項(xiàng)目并且加以優(yōu)化,現(xiàn)在由MVSEP-MDX23-Colab fork v2.2創(chuàng)造的成績(jī)?cè)贛VSEP天榜中,vocal排名第7,Instrument排名第6,這是我們目前能夠獲取的最好開(kāi)源成績(jī)。排名靠前的MVSep Ensemble模型都依賴未公開(kāi)發(fā)布的MDX23C模型,這個(gè)模型是MVSep網(wǎng)站目前的收費(fèi)內(nèi)容,需要付費(fèi)會(huì)員才能夠保存wav文件:


說(shuō)回MVSEP-MDX23,如果你需要運(yùn)行這個(gè)算法進(jìn)行音源分離,需要確保有至少11GB的顯存,否則無(wú)法運(yùn)行。

如果你的配置不足以本地運(yùn)行,可以直接去jarredou/MVSEP-MDX23-Colab_v2點(diǎn)擊他們的colab鏈接(需要魔法),將你想分離的文件上傳到實(shí)例網(wǎng)盤(pán)上,直接Runtime-Run All運(yùn)行即可:

各種weight參數(shù)可以直接去mvsep天榜抄,chunk size降低到不會(huì)報(bào)錯(cuò)為止。

如果你的硬件配置足夠本地運(yùn)行,那么部署就遵循標(biāo)準(zhǔn)的GitHub三部曲:clone,安裝環(huán)境,運(yùn)行。
由于本地運(yùn)行環(huán)境和Colab不太一樣,vscode的jupyter無(wú)法識(shí)別colab的那種交互性注釋。但無(wú)所謂,直接改寫(xiě)數(shù)值即可,同時(shí)把input和output_folder改成你電腦內(nèi)的目錄,例如:
如果參考MVSep天榜的成績(jī),那么參數(shù)應(yīng)當(dāng)如下:
把音頻文件放到input目錄下,點(diǎn)擊運(yùn)行,等待即可。
由于這個(gè)notebook使用的是“!”方法調(diào)用命令行執(zhí)行python腳本,所以在運(yùn)行過(guò)程中看不見(jiàn)輸出結(jié)果,在運(yùn)行完畢后結(jié)果才會(huì)一并蹦出來(lái),當(dāng)然改成%run也很簡(jiǎn)單,但是懶人包這樣子跑也無(wú)所謂。

我們可以看出,v2.2版本目前一共使用了Demucs_ft, MDXv3 demo, UVR-MDX-VOC-FT, UVR-MDX-VOC-FT Fullband SRS, UVR-MDX-HQ3-Instr, htdemucs_ft, htdemucs, htdemucs_6s, htdemucs_mmi總共9個(gè)模型來(lái)混合輸出結(jié)果,可謂是終極融合怪。
運(yùn)行完之后就去output目錄尋找成果吧。

以下是使用MVSEP-MDX23-Colab_v2分別對(duì)日式搖滾和電子音樂(lè)進(jìn)行分離的結(jié)果展示??梢园l(fā)現(xiàn),這個(gè)算法對(duì)vocal的分離基本已經(jīng)達(dá)到無(wú)懈可擊的程度,不受重混音和各種效果器的影響。同時(shí)歸功于htdemucs的加持和項(xiàng)目作者的優(yōu)化,這個(gè)算法對(duì)鼓組和bass的分離也已經(jīng)基本滿足扒譜甚至非專業(yè)生產(chǎn)水平。

為了測(cè)試重混音vocal分離,我把目標(biāo)盯上了b站各種歌勢(shì)的翻唱作品,多數(shù)歌勢(shì)會(huì)給自己的翻唱加非常非常重的混音,以此來(lái)掩蓋可能出現(xiàn)的缺陷,尤其是最近比較小火的《Golden Hour》,很多人的翻唱把混音開(kāi)得我聽(tīng)得頭疼。
為了增加難度,我選擇了@早稻嘰 翻唱的神中神作品《海色》,這首歌是一首非常激烈和飽滿的JRock,在嘗試分離這種類(lèi)型的曲子的時(shí)候,全頻的電吉他音無(wú)時(shí)無(wú)刻不在干擾Vocal,加之翻唱作品混音拖得很長(zhǎng),更是難上加難。

結(jié)果證明,MVSEP-MDX23對(duì)這首翻唱作品的分離做得基本上完美,并且在聆聽(tīng)分離出來(lái)的伴奏之后發(fā)現(xiàn),嘰嘰使用的伴奏有明顯的中空和凹陷感,判斷可能是通過(guò)比較舊的技術(shù)手段提取伴奏,或者后期加了比較重的兩頭翹EQ。
于是我通過(guò)耐心地搜索,找到了這首歌曲官方附贈(zèng)的Instrumental Version,并做出了這個(gè)對(duì)比視頻。嘰嘰唱的相比原曲升了2個(gè)半音,很強(qiáng)捏!
作為歌勢(shì),在產(chǎn)出作品的時(shí)候還是有必要追求原伴奏質(zhì)量的極限,以往的AI發(fā)展不盡完美,但到了2023,我認(rèn)為有好的工具(算法)就應(yīng)當(dāng)用起來(lái)。
當(dāng)然,跟我在視頻里說(shuō)的一樣,在準(zhǔn)備動(dòng)用技術(shù)手段之前,無(wú)論如何先去找找發(fā)行EP里有沒(méi)有附贈(zèng)伴奏版,畢竟再先進(jìn)的技術(shù)都不如原工程導(dǎo)出。