散文網(wǎng) » 科技 »學(xué)習(xí) » 【語音識別】基于結(jié)合mfcc和lpc特征、SVM支持向量機(jī)實現(xiàn)中英語種識別matlab源碼

【語音識別】基于結(jié)合mfcc和lpc特征、SVM支持向量機(jī)實現(xiàn)中英語種識別matlab源碼

2021-08-21 00:16 作者:Matlab工程師 0人讀過 | 我要投稿

?一、簡介

MFCC(Mel-frequency cepstral coefficients):梅爾頻率倒譜系數(shù)。梅爾頻率是基于人耳聽覺特性提出來的，它與Hz頻率成非線性對應(yīng)關(guān)系。梅爾頻率倒譜系數(shù)(MFCC)則是利用它們之間的這種關(guān)系，計算得到的Hz頻譜特征。主要用于語音數(shù)據(jù)特征提取和降低運(yùn)算維度。例如：對于一幀有512維(采樣點(diǎn))數(shù)據(jù)，經(jīng)過MFCC后可以提取出最重要的40維(一般而言)數(shù)據(jù)同時也達(dá)到了將維的目的。
MFCC一般會經(jīng)過這么幾個步驟：預(yù)加重，分幀，加窗，快速傅里葉變換(FFT)，梅爾濾波器組，離散余弦變換(DCT).其中最重要的就是FFT和梅爾濾波器組，這兩個進(jìn)行了主要的將維操作。
1.預(yù)加重
將經(jīng)采樣后的數(shù)字語音信號s(n)通過一個高通濾波器(high pass filter)：其中a一般取0.95左右。經(jīng)過預(yù)加重后的信號為：

預(yù)加重的目的是提升高頻部分，使信號的頻譜變得平坦，保持在低頻到高頻的整個頻帶中，能用同樣的信噪比求頻譜。同時，也是為了消除發(fā)生過程中聲帶和嘴唇的效應(yīng)，來補(bǔ)償語音信號受到發(fā)音系統(tǒng)所抑制的高頻部分，也為了突出高頻的共振峰。

2.分幀
為了方便對語音分析，可以將語音分成一個個小段，稱之為：幀。先將N個采樣點(diǎn)集合成一個觀測單位，稱為幀。通常情況下N的值為256或512，涵蓋的時間約為20~30ms左右。為了避免相鄰兩幀的變化過大，因此會讓兩相鄰幀之間有一段重疊區(qū)域，此重疊區(qū)域包含了M個取樣點(diǎn)，通常M的值約為N的1/2或1/3。通常語音識別所采用語音信號的采樣頻率為8KHz或16KHz，以8KHz來說，若幀長度為256個采樣點(diǎn)，則對應(yīng)的時間長度是256/8000×1000=32ms。

3.加窗
語音在長范圍內(nèi)是不停變動的，沒有固定的特性無法做處理，所以將每一幀代入窗函數(shù)，窗外的值設(shè)定為0，其目的是消除各個幀兩端可能會造成的信號不連續(xù)性。常用的窗函數(shù)有方窗、漢明窗和漢寧窗等，根據(jù)窗函數(shù)的頻域特性，常采用漢明窗。

將每一幀乘以漢明窗，以增加幀左端和右端的連續(xù)性。假設(shè)分幀后的信號為S(n), n=0,1…,N-1, N為幀的大小，那么乘上漢明窗后，W(n)形式如下：
不同的a值會產(chǎn)生不同的漢明窗，一般情況下a取0.46.

4.快速傅里葉變換
由于信號在時域上的變換通常很難看出信號的特性，所以通常將它轉(zhuǎn)換為頻域上的能量分布來觀察，不同的能量分布，就能代表不同語音的特性。所以在乘上漢明窗后，每幀還必須再經(jīng)過快速傅里葉變換以得到在頻譜上的能量分布。對分幀加窗后的各幀信號進(jìn)行快速傅里葉變換得到各幀的頻譜。并對語音信號的頻譜取模平方得到語音信號的功率譜。設(shè)語音信號的DFT為：

式中x(n)為輸入的語音信號，N表示傅里葉變換的點(diǎn)數(shù)。

這里需要先介紹下Nyquist頻率，奈奎斯特頻率（Nyquist頻率）是離散信號系統(tǒng)采樣頻率的一半，因哈里·奈奎斯特（Harry Nyquist）或奈奎斯特－香農(nóng)采樣定理得名。采樣定理指出，只要離散系統(tǒng)的奈奎斯特頻率高于被采樣信號的最高頻率或帶寬，就可以避免混疊現(xiàn)象。在語音系統(tǒng)中我通常采樣率取16khz，而人發(fā)生的頻率在300hz~3400hz之間，按照Nyquist頻率的定義就有Nyquist頻率等于8khz高于人發(fā)生的最高頻率，滿足Nyquist頻率的限制條件。FFT就是根據(jù)Nyquist頻率截取采樣率的一半來計算，具體來說就是，假設(shè)一幀有512個采樣點(diǎn)，傅里葉變換的點(diǎn)數(shù)也是512，經(jīng)過FFT計算后輸出的點(diǎn)數(shù)是257(N/2+1)，其含義表示的是從0(Hz)到采樣率/2(Hz)的N/2+1點(diǎn)頻率的成分。也就是說在經(jīng)過FFT計算時不僅把信號從時域轉(zhuǎn)到了頻域并且去除了高于被采樣信號的最高頻率的點(diǎn)的影響，同時也降低了維度。

5.梅爾濾波器組
由于人耳對不同頻率的敏感程度不同，且成非線性關(guān)系，因此我們將頻譜按人耳敏感程度分為多個Mel濾波器組，在Mel刻度范圍內(nèi)，各個濾波器的中心頻率是相等間隔的線性分布，但在頻率范圍不是相等間隔的，這個是由于頻率與Mel頻率轉(zhuǎn)換的公式形成的，公式如下：

式中的log是以log10為底，也就是lg。

將能量譜通過一組Mel尺度的三角形濾波器組，定義一個有M個濾波器的濾波器組（濾波器的個數(shù)和臨界帶的個數(shù)相近），采用的濾波器為三角濾波器，中心頻率為f(m),m=1,2,…,M。M通常取22-26。各f(m)之間的間隔隨著m值的減小而縮小，隨著m值的增大而增寬，如圖所示：

式中的k指經(jīng)過FFT計算后的點(diǎn)的下標(biāo)，也就是前面例子中的0~257，f(m)也對應(yīng)點(diǎn)的下標(biāo)，具體求法如下：

1.確定語音信號最低(一般是0hz)最高(一般是采樣率的二分之一)頻率以及Mel濾波器個數(shù)

2.計算對應(yīng)最低最高頻率的mel頻率

3.計算相鄰兩個mel濾波器中心頻率的距離：(最高mel頻率-最低mel頻率)/(濾波器個數(shù)+1)

4.將各個中心Mel頻率轉(zhuǎn)成頻率

5.計算頻率對應(yīng)FFT中點(diǎn)的下標(biāo)

例如：假設(shè)采樣率為16khz，最低頻率為0hz，濾波器個數(shù)為26，幀大小為512，則傅里葉變換點(diǎn)數(shù)也為512，那么帶入Mel頻率與實際頻率的轉(zhuǎn)換公式中得到最低Mel頻率為0，最高M(jìn)el頻率為2840.02.中心頻率距離為：(2840.02-0)/(26+1)=105.19，這樣我們就可以得到Mel濾波器組的中心頻率：[0，105.19，210.38，…，2840.02]，然后再將這組中心頻率轉(zhuǎn)成實際頻率組(按公式操作即可，這里不列出來了)，最后計算實際頻率組對應(yīng)FFT點(diǎn)的下標(biāo)，計算公式為：實際頻率組中的每個頻率/采樣率*(傅里葉變換點(diǎn)數(shù) + 1)。這樣就得到FFT點(diǎn)下標(biāo)組：[0,2,4,7,10,13,16，…，256]，也就是f(0),f(1),…,f(27)。
有了這些，我們在計算每個濾波器的輸出，計算公式如下：
式中的M指濾波器的個數(shù)，N指FFT中的點(diǎn)數(shù)(上述的例子中是257)。經(jīng)過上面的計算后每幀數(shù)據(jù)我們得到一個與濾波器個數(shù)相等的維數(shù)，降低了維數(shù)(本例中是26維)。

6.離散余弦變換
離散余弦變換經(jīng)常用于信號處理和圖像處理，用來對信號和圖像進(jìn)行有損數(shù)據(jù)壓縮，這是由于離散余弦變換具有很強(qiáng)的"能量集中"特性：大多數(shù)的自然信號（包括聲音和圖像）的能量都集中在離散余弦變換后的低頻部分，實際就是對每幀數(shù)據(jù)在進(jìn)行一次將維。其公式如下：
將上述每個濾波器的對數(shù)能量帶入離散余弦變換，求出L階的Mel-scale Cepstrum參數(shù)。L階指MFCC系數(shù)階數(shù)，通常取12-16。這里M是三角濾波器個數(shù)。

7.動態(tài)差分參數(shù)的提取
標(biāo)準(zhǔn)的倒譜參數(shù)MFCC只反映了語音參數(shù)的靜態(tài)特性，語音的動態(tài)特性可以用這些靜態(tài)特征的差分譜來描述。實驗證明：把動、靜態(tài)特征結(jié)合起來才能有效提高系統(tǒng)的識別性能。差分參數(shù)的計算可以采用下面的公式：
式中,dt表示第t個一階差分，Ct表示第t個倒譜系數(shù)，Q表示倒譜系數(shù)的階數(shù)，K表示一階導(dǎo)數(shù)的時間差，可取1或2。將上式的結(jié)果再代入就可以得到二階差分的參數(shù)。
因此，MFCC的全部組成其實是由： N維MFCC參數(shù)（N/3 MFCC系數(shù)+ N/3 一階差分參數(shù)+ N/3 二階差分參數(shù)）+幀能量（此項可根據(jù)需求替換）。
這里的幀能量是指一幀的音量（即能量），也是語音的重要特征，而且非常容易計算。因此，通常再加上一幀的對數(shù)能量（定義：一幀內(nèi)信號的平方和，再取以10為底的對數(shù)值，再乘以10）使得每一幀基本的語音特征就多了一維，包括一個對數(shù)能量和剩下的倒頻譜參數(shù)。另外，解釋下最開始說的40維是怎么回事，假設(shè)離散余弦變換的階數(shù)取13，那么經(jīng)過一階二階差分后就是39維了再加上幀能量總共就是40維，當(dāng)然這個可以根據(jù)實際需要動態(tài)調(diào)整。

二、源代碼

clc; clear; load traindata Myfeature A1=zeros(1,30); A2=ones(1,30); Group=[A1,A2]; TrainData=Myfeature; SVMStruct = svmtrain(TrainData,Group); N=5.3; Tw = 25; ? ? ? ? ? % analysis frame duration (ms) Ts = 10; ? ? ? ? ? % analysis frame shift (ms) alpha = 0.97; ? ? ?% preemphasis coefficient R = [ 300 3700 ]; ?% frequency range to consider M = 20; ? ? ? ? ? ?% number of filterbank channels C = 13; ? ? ? ? ? ?% number of cepstral coefficients L = 22; ? ? ? ? ? ?% cepstral sine lifter parameter fs = 16000; hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1))); [filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '選擇語音'); % %沒有圖像 if filename == 0 ? ? ? ?return; end [speech,fs] = audioread([pathname, filename]); [voice,fs]=extractvoice_simple(speech,-30, -20,0.2); voicex=voice(1:N*16000); [ mfccs, FBEs, frames ] = ... ? ?mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L ); ceps_mfccx=mfccs(:); [cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC ? ? ? ? ? ?%[lpc,ER]=lpces(voice,12,256,256); ? ? ? ? ? ?%ceps_lpcc=lpc2lpcc(cep);%LPCC ? ? ? ? ? ?ceps_lpcx=ceps_lpc(:); ? ? ? ? ? ?ceps=[ceps_mfccx(1000:2000);ceps_lpcx(1:2000)]; ? ? ? ? ? ?TestData = ceps'; languagex=svmclassify(SVMStruct,TestData); if languagex == 1 ? ?language='Chinese' else ? ?language='English' end % t=[1:2000]; % figure % scatter(t,ceps_lpcx(1:2000),50,'r'); % xlabel('sample point'); % ylabel('LPC'); % title('LPC features'); % hold on % [filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '選擇語音'); % % %沒有圖像 % if filename == 0 ? ? % ? ? return; % end % [speech,fs] = audioread([pathname, filename]); % [voice,fs]=extractvoice_simple(speech,-30, -20,0.2); % voicex=voice(1:N*16000); % [ mfccs, FBEs, frames ] = ... % ? ? mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L ); % ?ceps_mfccx=mfccs(:); % ?[cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC % function [ H, f, c ] = trifbank( M, K, R, fs, h2w, w2h ) % TRIFBANK Triangular filterbank. % % ? [H,F,C]=TRIFBANK(M,K,R,FS,H2W,W2H) returns matrix of M triangular filters % ? (one per row), each K coefficients long along with a K coefficient long % ? frequency vector F and M+2 coefficient long cutoff frequency vector C. % ? The triangular filters are between limits given in R (Hz) and are % ? uniformly spaced on a warped scale defined by forward (H2W) and backward % ? (W2H) warping functions. % % ? Inputs % ? ? ? ? ? M is the number of filters, i.e., number of rows of H % % ? ? ? ? ? K is the length of frequency response of each filter % ? ? ? ? ? ? i.e., number of columns of H % % ? ? ? ? ? R is a two element vector that specifies frequency limits (Hz), % ? ? ? ? ? ? i.e., R = [ low_frequency high_frequency ]; % % ? ? ? ? ? FS is the sampling frequency (Hz) % % ? ? ? ? ? H2W is a Hertz scale to warped scale function handle % % ? ? ? ? ? W2H is a wared scale to Hertz scale function handle % % ? Outputs % ? ? ? ? ? H is a M by K triangular filterbank matrix (one filter per row) % % ? ? ? ? ? F is a frequency vector (Hz) of 1xK dimension % % ? ? ? ? ? C is a vector of filter cutoff frequencies (Hz), % ? ? ? ? ? ? note that C(2:end) also represents filter center frequencies, % ? ? ? ? ? ? and the dimension of C is 1x(M+2) % % ? Example % ? ? ? ? ? fs = 16000; ? ? ? ? ? ? ? % sampling frequency (Hz) % ? ? ? ? ? nfft = 2^12; ? ? ? ? ? ? ?% fft size (number of frequency bins) % ? ? ? ? ? K = nfft/2+1; ? ? ? ? ? ? % length of each filter % ? ? ? ? ? M = 23; ? ? ? ? ? ? ? ? ? % number of filters % % ? ? ? ? ? hz2mel = @(hz)(1127*log(1+hz/700)); % Hertz to mel warping function % ? ? ? ? ? mel2hz = @(mel)(700*exp(mel/1127)-700); % mel to Hertz warping function % % ? ? ? ? ? % Design mel filterbank of M filters each K coefficients long, % ? ? ? ? ? % filters are uniformly spaced on the mel scale between 0 and Fs/2 Hz % ? ? ? ? ? [ H1, freq ] = trifbank( M, K, [0 fs/2], fs, hz2mel, mel2hz ); % % ? ? ? ? ? % Design mel filterbank of M filters each K coefficients long, % ? ? ? ? ? % filters are uniformly spaced on the mel scale between 300 and 3750 Hz % ? ? ? ? ? [ H2, freq ] = trifbank( M, K, [300 3750], fs, hz2mel, mel2hz ); % % ? ? ? ? ? % Design mel filterbank of 18 filters each K coefficients long, % ? ? ? ? ? % filters are uniformly spaced on the Hertz scale between 4 and 6 kHz % ? ? ? ? ? [ H3, freq ] = trifbank( 18, K, [4 6]*1E3, fs, @(h)(h), @(h)(h) ); % % ? ? ? ? ? ?hfig = figure('Position', [25 100 800 600], 'PaperPositionMode', ... % ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'auto', 'Visible', 'on', 'color', 'w'); hold on; % ? ? ? ? ? subplot( 3,1,1 ); % ? ? ? ? ? plot( freq, H1 ); % ? ? ? ? ? xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); % ? ? ? % ? ? ? ? ? subplot( 3,1,2 ); % ? ? ? ? ? plot( freq, H2 ); % ? ? ? ? ? xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); % ? ? ? % ? ? ? ? ? subplot( 3,1,3 ); % ? ? ? ? ? plot( freq, H3 ); % ? ? ? ? ? xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); % % ? Reference % ? ? ? ? ? [1] Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing: % ? ? ? ? ? ? ? A guide to theory, algorithm, and system development. % ? ? ? ? ? ? ? Prentice Hall, Upper Saddle River, NJ, USA (pp. 314-315). % ? Author ?Kamil Wojcicki, UTD, June 2011 ? ?if( nargin~= 6 ), help trifbank; return; end; % very lite input validation ? ?f_min = 0; ? ? ? ? ?% filter coefficients start at this frequency (Hz) ? ?f_low = R(1); ? ? ? % lower cutoff frequency (Hz) for the filterbank ? ?f_high = R(2); ? ? ?% upper cutoff frequency (Hz) for the filterbank ? ?f_max = 0.5*fs; ? ? % filter coefficients end at this frequency (Hz) ? ?f = linspace( f_min, f_max, K ); % frequency range (Hz), size 1xK ? ?fw = h2w( f ); ? ?% filter cutoff frequencies (Hz) for all filters, size 1x(M+2) ? ?c = w2h( h2w(f_low)+[0:M+1]*((h2w(f_high)-h2w(f_low))/(M+1)) ); ? ?cw = h2w( c ); ? ?H = zeros( M, K ); ? ? ? ? ? ? ? ? ?% zero otherwise ? ?for m = 1:M ? ? ? ?% implements Eq. (6.140) on page 314 of [1] ? ? ? ?% k = f>=c(m)&f<=c(m+1); % up-slope ? ? ? ?% H(m,k) = 2*(f(k)-c(m)) / ((c(m+2)-c(m))*(c(m+1)-c(m))); ? ? ? ?% k = f>=c(m+1)&f<=c(m+2); % down-slope ? ? ? ?% H(m,k) = 2*(c(m+2)-f(k)) / ((c(m+2)-c(m))*(c(m+2)-c(m+1))); ? ? ? ?% implements Eq. (6.141) on page 315 of [1] ? ? ? ?k = f>=c(m)&f<=c(m+1); % up-slope ? ? ? ?H(m,k) = (f(k)-c(m))/(c(m+1)-c(m)); ? ? ? ?k = f>=c(m+1)&f<=c(m+2); % down-slope ? ? ? ?H(m,k) = (c(m+2)-f(k))/(c(m+2)-c(m+1)); ? ? ? ? end