北太天元學(xué)習(xí)44-文本分析之詞干提取(porter詞干提取算法)
不好意思,我是從網(wǎng)上學(xué)習(xí)的如何處理英文的文本,如果是中文的文本,我們處理起來(lái)有方便的地方,那就是我們實(shí)際上是不需要詞干提取工作,但是中文相比于英文需要一個(gè)額外的工作是分詞。 他山之石,可以攻玉。
我也先簡(jiǎn)單介紹一下詞干提取(stemming)和詞形還原(lemmatization)。
由于語(yǔ)法原因,文檔將使用單詞的不同形式,如organize、organizes 和 organizing。此外,還有一些派生相關(guān)的單詞家族具有相似的含義,如democracy, democratic, and democratization。在許多情況下,搜索其中一個(gè)單詞以返回包含這個(gè)單詞其他變形形式或者衍生形式的詞,這個(gè)讓人覺(jué)得還是很智能還有用的。
詞干提取通常指的是一種粗糙的啟發(fā)式過(guò)程,在大多數(shù)情況下,為了正確地實(shí)現(xiàn)這一目標(biāo),它會(huì)去除詞的后綴?!霸~形還原”? 不是簡(jiǎn)單地將前后綴去掉,而是會(huì)根據(jù)詞典將單詞進(jìn)行轉(zhuǎn)換。比如「drove」會(huì)轉(zhuǎn)換為「drive」。不同于詞干提?。╯temming),詞形還原后的單詞不一定會(huì)出現(xiàn)在單詞中,比如單詞“ate”詞形還原后的單詞為“eat”。
Porter's stemming 算法可以見(jiàn)Porter20世紀(jì)80年代的論文,其主要關(guān)注點(diǎn)是刪除單詞的共同結(jié)尾,以便將它們解析為通用形式。他原來(lái)的代碼是c語(yǔ)言寫的,我把他移植到北太天元上。
可以把下面的代碼全部copy到名字為 porterStemmer.m 的文件里,然后在北太天元使用。
%北太天元的 詞干提取(Stem) 算法
% 輸入?yún)?shù): inString? 需要是一個(gè)char mat, 需要是小些字符,
%?????????? 在調(diào)用這個(gè)函數(shù)之前,先把字符串變成小寫,不然
%?????????? 不會(huì)執(zhí)行詞干提取操作。
% 輸出參數(shù): stem 是一個(gè)char mat, 是inString的詞干
% 例如
%? cars_stem = porterStemmer('cars')
%? cleaning_stem = porterStemmer('cleaning')
?? ?function stem = porterStemmer(inString)
/*
實(shí)現(xiàn)Porter-Stemming算法,如下論文所示:
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137
原始的代碼是C語(yǔ)言寫的,可以在下面的鏈接處下載:
http://www.tartarus.org/~martin/PorterStemmer/c.txt
詞干提取算法, 例如cleaning的詞干是clean, cars 的詞干是car。
*/
if (nargin ~=1)
?? ?help porterStemmer;
?? ?return;
end
if( ~ischar(inString) )
??? ?error("porterStemmer(inString)的輸入?yún)?shù)必須是單引號(hào)字符串");
end?? ?
global j;
b = inString;
k = length(b);
k0 = 1;
j = k;
/*
使用下面if語(yǔ)句,長(zhǎng)度為1或2的字符串不會(huì)執(zhí)行詞干提取操作。
如果想去掉這個(gè)限制,請(qǐng)刪除此條件以匹配你的需求
*/
stem = b;
if k > 2
??? % 每個(gè)步驟的輸出顯示被注釋掉。
??? %disp(sprintf('Word to stem: %s', b));
??? x = step1ab(b, k, k0);
??? %disp(sprintf('Steps 1A and B yield: %s', x{1}));
??? x = step1c(x{1}, x{2}, k0);
??? %disp(sprintf('Step 1C yields: %s', x{1}));
??? x = step2(x{1}, x{2}, k0);
??? %disp(sprintf('Step 2 yields: %s', x{1}));
??? x = step3(x{1}, x{2}, k0);
??? %disp(sprintf('Step 3 yields: %s', x{1}));
??? x = step4(x{1}, x{2}, k0);
??? %disp(sprintf('Step 4 yields: %s', x{1}));
??? x = step5(x{1}, x{2}, k0);
??? %disp(sprintf('Step 5 yields: %s', x{1}));
??? stem = x{1};
end
% cons(j) 返回TRUE 當(dāng)且僅當(dāng) b[j] 是個(gè)輔音
function c = cons(i, b, k0)
c = true;
switch(b(i))
??? case {'a', 'e', 'i', 'o', 'u'}
??????? c = false;
??? case 'y'??? %y是個(gè)半元音,需要單獨(dú)判斷
??????? if i == k0
??????????? c = true;
??????? else
??????????? c = ~cons(i - 1, b, k0);
??????? end
end
% mseq() 測(cè)量k0和j之間的輔音序列的數(shù)量。如果c是輔音序列,
% v是元音序列,并且<c>表示任意多個(gè)的元音序列
% <v>表示任意多個(gè)輔音序列
%????? <c><v>?????? gives 0
%????? <c>vc<v>???? gives 1
%????? <c>vcvc<v>?? gives 2
%????? <c>vcvcvc<v> gives 3
%????? ....
function n = measure(b, k0)
global j;
n = 0;
i = k0;
while true
??? if i > j
??????? return
??? end
??? if ~cons(i, b, k0)
??????? break;
??? end
??? i = i + 1;
end
i = i + 1;
while true
??? while true
??????? if i > j
??????????? return
??????? end
??????? if cons(i, b, k0)
??????????? break;
??????? end
??????? i = i + 1;
??? end
??? i = i + 1;
??? n = n + 1;
??? while true
??????? if i > j
??????????? return
??????? end
??????? if ~cons(i, b, k0)
??????????? break;
??????? end
??????? i = i + 1;
??? end
??? i = i + 1;
end
% vowelinstem() 返回 TRUE 當(dāng)切僅當(dāng) k0,...j 包含元音
function vis = vowelinstem(b, k0)
global j;
for i = k0:j,
??? if ~cons(i, b, k0)
??????? vis = true;
??????? return
??? end
end
vis = false;
%doublec(i) is TRUE <=> i,(i-1) contain a double consonant.
function dc = doublec(i, b, k0)
if i < k0+1
??? dc = false;
??? return
end
if b(i) ~= b(i-1)
??? dc = false;
??? return
end
dc = cons(i, b, k0);
% cvc(j) is TRUE <=> j-2,j-1,j has the form consonant - vowel - consonant
% and also if the second c is not w,x or y. this is used when trying to
% restore an e at the end of a short word. e.g.
%
%????? cav(e), lov(e), hop(e), crim(e), but
%????? snow, box, tray.
function c1 = cvc(i, b, k0)
if ((i < (k0+2)) || ~cons(i, b, k0) || cons(i-1, b, k0) || ~cons(i-2, b, k0))
??? c1 = false;
else
??? if (b(i) == 'w' || b(i) == 'x' || b(i) == 'y')
??????? c1 = false;
??????? return
??? end
??? c1 = true;
end
% ends(s) is TRUE <=> k0,...k ends with the string s.
function s = ends(str, b, k)
global j;
if (str(length(str)) ~= b(k))
??? s = false;
??? return
end % tiny speed-up
if (length(str) > k)
??? s = false;
??? return
end
if strcmp(b(k-length(str)+1:k), str)
??? s = true;
??? j = k - length(str);
??? return
else
??? s = false;
end
% setto(s) sets (j+1),...k to the characters in the string s, readjusting
% k accordingly.
function so = setto(s, b, k)
global j;
for i = j+1:(j+length(s))
??? b(i) = s(i-j);
end
if k > j+length(s)
??? b((j+length(s)+1):k) = '';
end
k = length(b);
so = {b, k};
% rs(s) is used further down.
% [Note: possible null/value for r if rs is called]
function r = rs(str, b, k, k0)
r = {b, k};
if measure(b, k0) > 0
??? r = setto(str, b, k);
end
% step1ab() gets rid of plurals and -ed or -ing. e.g.
%?????? caresses? ->? caress
%?????? ponies??? ->? poni
%?????? ties????? ->? ti
%?????? caress??? ->? caress
%?????? cats????? ->? cat
%?????? feed????? ->? feed
%?????? agreed??? ->? agree
%?????? disabled? ->? disable
%?????? matting?? ->? mat
%?????? mating??? ->? mate
%?????? meeting?? ->? meet
%?????? milling?? ->? mill
%?????? messing?? ->? mess
%?????? meetings? ->? meet
function s1ab = step1ab(b, k, k0)
global j;
if b(k) == 's'
??? if ends('sses', b, k)
??????? k = k-2;
??? elseif ends('ies', b, k)
??????? retVal = setto('i', b, k);
??????? b = retVal{1};
??????? k = retVal{2};
??? elseif (b(k-1) ~= 's')
??????? k = k-1;
??? end
end
if ends('eed', b, k)
??? if measure(b, k0) > 0;
??????? k = k-1;
??? end
elseif (ends('ed', b, k) || ends('ing', b, k)) && vowelinstem(b, k0)
??? k = j;
?? ?% 這句話正常情況下應(yīng)該計(jì)算正確的,但是這里的j是在vowelingstem(b,k0)和ends('ing',b,k)
??? % 作為全局變量而被修改的,因此k=j的時(shí)候并沒(méi)有使用修改之后的值。
??? % 因此,我用了下面的字符串str, 然后使用eval函數(shù)重新執(zhí)行了一遍,可以在北太天元上得到
??? % 正確結(jié)果了。
??? str = [ 'k =', num2str(j)]
??? eval(str)
??? retVal = {b, k};
??? if ends('at', b, k)
??????? retVal = setto('ate', b(k0:k), k);
??? elseif ends('bl', b, k)
??????? retVal = setto('ble', b(k0:k), k);
??? elseif ends('iz', b, k)
??????? retVal = setto('ize', b(k0:k), k);
??? elseif doublec(k, b, k0)
??????? retVal = {b, k-1};
??????? if b(retVal{2}) == 'l' || b(retVal{2}) == 's' || ...
??????????????? b(retVal{2}) == 'z'
??????????? retVal = {retVal{1}, retVal{2}+1};
??????? end
??? elseif measure(b, k0) == 1 && cvc(k, b, k0)
??????? retVal = setto('e', b(k0:k), k);
??? end
??? k = retVal{2};
??? b = retVal{1}(k0:k);
end
j = k;
s1ab = {b(k0:k), k};
%? step1c() 當(dāng)詞干中有另一個(gè)元音時(shí),將詞尾y變?yōu)閕。
%? playing --> play --> plai 反而是個(gè)錯(cuò)誤,原來(lái)的c代碼也是
% 有這個(gè)毛病,為了保持原汁原味,這里也沒(méi)有修改這個(gè)毛病。
function s1c = step1c(b, k, k0)
global j;
if ends('y', b, k) && vowelinstem(b, k0)
??? b(k) = 'i';
end
j = k;
s1c = {b, k};
% step2() maps double suffices to single ones. so -ization ( = -ize plus
% -ation) maps to -ize etc. note that the string before the suffix must give
% m() > 0.
function s2 = step2(b, k, k0)
global j;
s2 = {b, k};
switch b(k-1)
??? case {'a'}
??????? if ends('ational', b, k); s2 = rs('ate', b, k, k0);
??????? elseif ends('tional', b, k); s2 = rs('tion', b, k, k0); end;
??? case {'c'}
??????? if ends('enci', b, k); s2 = rs('ence', b, k, k0);
??????? elseif ends('anci', b, k); s2 = rs('ance', b, k, k0); end;
??? case {'e'}
??????? if ends('izer', b, k); s2 = rs('ize', b, k, k0); end;
??? case {'l'}
??????? if ends('bli', b, k); s2 = rs('ble', b, k, k0);
??????? elseif ends('alli', b, k); s2 = rs('al', b, k, k0);
??????? elseif ends('entli', b, k); s2 = rs('ent', b, k, k0);
??????? elseif ends('eli', b, k); s2 = rs('e', b, k, k0);
??????? elseif ends('ousli', b, k); s2 = rs('ous', b, k, k0); end;
??? case {'o'}
??????? if ends('ization', b, k); s2 = rs('ize', b, k, k0);
??????? elseif ends('ation', b, k); s2 = rs('ate', b, k, k0);
??????? elseif ends('ator', b, k); s2 = rs('ate', b, k, k0); end;
??? case {'s'}
??????? if ends('alism', b, k); s2 = rs('al', b, k, k0);
??????? elseif ends('iveness', b, k); s2 = rs('ive', b, k, k0);
??????? elseif ends('fulness', b, k); s2 = rs('ful', b, k, k0);
??????? elseif ends('ousness', b, k); s2 = rs('ous', b, k, k0); end;
??? case {'t'}
??????? if ends('aliti', b, k); s2 = rs('al', b, k, k0);
??????? elseif ends('iviti', b, k); s2 = rs('ive', b, k, k0);
??????? elseif ends('biliti', b, k); s2 = rs('ble', b, k, k0); end;
??? case {'g'}
??????? if ends('logi', b, k); s2 = rs('log', b, k, k0); end;
end
j = s2{2};
% step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
function s3 = step3(b, k, k0)
global j;
s3 = {b, k};
switch b(k)
??? case {'e'}
??????? if ends('icate', b, k); s3 = rs('ic', b, k, k0);
??????? elseif ends('ative', b, k); s3 = rs('', b, k, k0);
??????? elseif ends('alize', b, k); s3 = rs('al', b, k, k0); end;
??? case {'i'}
??????? if ends('iciti', b, k); s3 = rs('ic', b, k, k0); end;
??? case {'l'}
??????? if ends('ical', b, k); s3 = rs('ic', b, k, k0);
??????? elseif ends('ful', b, k); s3 = rs('', b, k, k0); end;
??? case {'s'}
??????? if ends('ness', b, k); s3 = rs('', b, k, k0); end;
end
j = s3{2};
% step4() 脫去 -ant, -ence 等, 如果這些是在<c>vcvc<v>的上下文中.
function s4 = step4(b, k, k0)
global j;
switch b(k-1)
??? case {'a'}
??????? if ends('al', b, k); end;
??? case {'c'}
??????? if ends('ance', b, k)
??????? elseif ends('ence', b, k); end;
??? case {'e'}
??????? if ends('er', b, k); end;
??? case {'i'}
??????? if ends('ic', b, k); end;
??? case {'l'}
??????? if ends('able', b, k)
??????? elseif ends('ible', b, k); end;
??? case {'n'}
??????? if ends('ant', b, k)
??????? elseif ends('ement', b, k)
??????? elseif ends('ment', b, k)
??????? elseif ends('ent', b, k); end;
??? case {'o'}
??????? if ends('ion', b, k)
??????????? if j == 0
??????????? elseif ~(strcmp(b(j),'s') || strcmp(b(j),'t'))
??????????????? j = k;
??????????? end
??????? elseif ends('ou', b, k); end;
??? case {'s'}
??????? if ends('ism', b, k); end;
??? case {'t'}
??????? if ends('ate', b, k)
??????? elseif ends('iti', b, k); end;
??? case {'u'}
??????? if ends('ous', b, k); end;
??? case {'v'}
??????? if ends('ive', b, k); end;
??? case {'z'}
??????? if ends('ize', b, k); end;
end
if measure(b, k0) > 1
??? s4 = {b(k0:j), j};
else
??? s4 = {b(k0:k), k};
end
% step5() 如果m()>1,則刪除最后一個(gè)-e,并且把-ll更改為-l。
function s5 = step5(b, k, k0)
global j;
j = k;
if b(k) == 'e'
??? a = measure(b, k0);
??? if (a > 1) || ((a == 1) && ~cvc(k-1, b, k0))
??????? k = k-1;
??? end
end
if (b(k) == 'l') && doublec(k, b, k0) && (measure(b, k0) > 1)
??? k = k-1;
end
s5 = {b(k0:k), k};