卡方分箱(chi-square)

2020-10-21 19:49 作者:python風(fēng)控模型 0人讀過 | 我要投稿

今天主要給大家講講卡方分箱算法ChiMerge。先給大家介紹一下經(jīng)常被提到的卡方分布和卡方檢驗(yàn)是什么。

一、卡方分布

卡方分布(chi-square distribution, χ2-distribution)是概率統(tǒng)計(jì)里常用的一種概率分布，也是統(tǒng)計(jì)推斷里應(yīng)用最廣泛的概率分布之一，在假設(shè)檢驗(yàn)與置信區(qū)間的計(jì)算中經(jīng)常能見到卡方分布的身影。

卡方分布的定義如下：

若k個(gè)獨(dú)立的隨機(jī)變量Z1, Z2,..., Zk 滿足標(biāo)準(zhǔn)正態(tài)分布 N(0,1) , 則這k個(gè)隨機(jī)變量的平方和：

為服從自由度為k的卡方分布，記作：

或者記作

二、卡方檢驗(yàn)

χ2檢驗(yàn)是以χ2分布為基礎(chǔ)的一種假設(shè)檢驗(yàn)方法，主要用于分類變量之間的獨(dú)立性檢驗(yàn)。

其基本思想是根據(jù)樣本數(shù)據(jù)推斷總體的分布與期望分布是否有顯著性差異，或者推斷兩個(gè)分類變量是否相關(guān)或者獨(dú)立。

一般可以設(shè)原假設(shè)為：觀察頻數(shù)與期望頻數(shù)沒有差異，或者兩個(gè)變量相互獨(dú)立不相關(guān)。

實(shí)際應(yīng)用中，我們先假設(shè)原假設(shè)成立，計(jì)算出卡方的值，卡方表示觀察值與理論值間的偏離程度。

卡方值的計(jì)算公式為：

其中A為實(shí)際頻數(shù)，E為期望頻數(shù)?？ǚ街涤糜诤饬繉?shí)際值與理論值的差異程度，這也是卡方檢驗(yàn)的核心思想。

卡方值包含了以下兩個(gè)信息：

1.實(shí)際值與理論值偏差的絕對大小。 2.差異程度與理論值的相對大小。

上述計(jì)算的卡方值服從卡方分布。根據(jù)卡方分布，卡方統(tǒng)計(jì)量以及自由度，可以確定在原假設(shè)成立的情況下獲得當(dāng)前統(tǒng)計(jì)量以及更極端情況的概率p。如果p很小，說明觀察值與理論值的偏離程度大，應(yīng)該拒絕原假設(shè)。否則不能拒絕原假設(shè)。

三、卡方檢驗(yàn)實(shí)例

某醫(yī)院對某種病癥的患者使用了A，B兩種不同的療法，結(jié)果如表1，問兩種療法有無差別？

表1 兩種療法治療卵巢癌的療效比較

可以計(jì)算出各格內(nèi)的期望頻數(shù)。

第1行1列： 43×53/87=26.2

第1行2列： 43×34/87=16.8

第2行1列： 44×53/87=26.8

第2行2列： 4×34/87=17.2

先建立原假設(shè)：A、B兩種療法沒有區(qū)別。根據(jù)卡方值的計(jì)算公式，計(jì)算：

算得卡方值=10.01。

得到卡方值以后，接下來需要查詢卡方分布表來判斷p值，從而做出接受或拒絕原假設(shè)的決定。

首先我們明確自由度的概念：自由度k=(行數(shù)-1)*(列數(shù)-1)。這里k=1.然后看卡方分布的臨界概率表，我們可以用如下代碼生成：

#python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課：http://dwz.date/b9vv

import numpy as np

from scipy.stats import chi2

import pandas as pd

# chi square distribution

percents = [ 0.95, 0.90, 0.5,0.1, 0.05, 0.025, 0.01, 0.005]

df =pd.DataFrame(np.array([chi2.isf(percents, df=i) for i in range(1, 30)]))

df.columns = percents

df.index =df.index+1

pd.set_option('precision', 3)

查表自由度為1,p=0.05的卡方值為3.841，而此例卡方值10.01>3.841，因此 p < 0.05，說明原假設(shè)在0.05的顯著性水平下是可以拒絕的。也就是說，原假設(shè)不成立。

四、ChiMerge分箱算法

ChiMerge卡方分箱算法由Kerber于1992提出。

它主要包括兩個(gè)階段：初始化階段和自底向上的合并階段。

1.初始化階段：

首先按照屬性值的大小進(jìn)行排序（對于非連續(xù)特征，需要先做數(shù)值轉(zhuǎn)換，比如轉(zhuǎn)為壞人率，然后排序），然后每個(gè)屬性值單獨(dú)作為一組。

2.合并階段：

（1）對每一對相鄰的組，計(jì)算卡方值。

（2）根據(jù)計(jì)算的卡方值，對其中最小的一對鄰組合并為一組。

（3）不斷重復(fù)（1），（2）直到計(jì)算出的卡方值都不低于事先設(shè)定的閾值，或者分組數(shù)達(dá)到一定的條件（如最小分組數(shù)5，最大分組數(shù)8）。

值得注意的是，小編之前發(fā)現(xiàn)有的實(shí)現(xiàn)方法在合并階段，計(jì)算的并非相鄰組的卡方值（只考慮在此兩組內(nèi)的樣本，并計(jì)算期望頻數(shù)），因?yàn)樗麄冇谜w樣本來計(jì)算此相鄰兩組的期望頻數(shù)。

下圖是著名的鳶尾花數(shù)據(jù)集sepal-length屬性值的分組及相鄰組的卡方值。最左側(cè)是屬性值，中間3列是class的頻數(shù)，最右是卡方值。這個(gè)分箱是以卡方閾值1.4的結(jié)果?？梢钥闯?，最小的組為[6.7，7.0），它的卡方值是1.5。

如果進(jìn)一步提高閾值，如設(shè)置為4.6，那么以上分箱還將繼續(xù)合并，最終的分箱如下圖：

卡方分箱除了用閾值來做約束條件，還可以進(jìn)一步的加入分箱數(shù)約束，以及最小箱占比，壞人率約束等。

卡方分箱之python代碼實(shí)

在上篇文章中，介紹了卡方分箱的基本思想和方法，都是概念性的東西，也沒有給出具體的代碼實(shí)現(xiàn)。這篇文章就來介紹下小編寫的ChiMerge算法的實(shí)現(xiàn)。

卡方值計(jì)算

計(jì)算卡方值的函數(shù)需要輸入numpy格式的頻數(shù)表。對于pandas數(shù)據(jù)集，只需使用pd.crosstab計(jì)算即可，例如變量“總賬戶數(shù)” 與目標(biāo)變量 “是否壞客戶” 的頻數(shù)表，如下圖：

每一行代表一個(gè)區(qū)間（組）的頻數(shù)，如上圖中第一行表示總賬戶數(shù)在[2,3) 這個(gè)組內(nèi)對應(yīng)的好客戶3個(gè)，壞客戶1個(gè)。

將頻數(shù)表轉(zhuǎn)成numpy數(shù)組，然后調(diào)用函數(shù)計(jì)算卡方值，計(jì)算邏輯如下：

1）計(jì)算第 i 行的總數(shù)。

2）計(jì)算第 j 列的總數(shù)。

3）計(jì)算總頻數(shù) N。

4）計(jì)算第 i,j 格的期望頻數(shù)。

5）求的每個(gè)格中的卡方:

6）由于期望頻數(shù) Ei,j有可能是0，此時(shí)上一步計(jì)算出來的結(jié)果無意義，需要清除，不計(jì)入最終結(jié)果。

7）把所有格的卡方相加得到卡方值。

代碼如下

'author:xiaodongxu&monica'

ChiMerge分箱算法

卡方分箱函數(shù)可以根據(jù)最大分組數(shù)目和卡方閾值來控制最終的分箱數(shù)。

如果調(diào)用時(shí)既沒有設(shè)置最大分組數(shù)，也沒有指定閾值，那么函數(shù)會(huì)自動(dòng)使用95%的置信度設(shè)置閾值。

分箱邏輯是：

1）初始時(shí)，所有變量值都自成一組，統(tǒng)計(jì)頻數(shù)。

2）然后按照各組起始值從小到大，依次掃描，取出兩組拼成計(jì)算卡方值。

如果當(dāng)前計(jì)算出的卡方值小于已觀察到的最小卡方值，則標(biāo)記當(dāng)前坐標(biāo)，并更新已觀察最小卡方值為當(dāng)前值。

3）掃描一遍后，如果當(dāng)前分組數(shù)大于最大分組數(shù)，或者最小卡方值小于閾值，就將最小卡方值對應(yīng)的兩組頻數(shù)合并，區(qū)間也合并。并回第2步執(zhí)行。否則，停止合并。輸出當(dāng)前各組的區(qū)間切分點(diǎn)。

代碼如下

'author:xiaodongxu&monica'

變量值轉(zhuǎn)分組

卡方分箱完成后，得到了各個(gè)分組的區(qū)間起始值。對于任給的一個(gè)變量值x，可以使用如下的函數(shù)獲得分組值。

代碼如下

'author:xiaodongxu&monica'

需要注意的是，如果需要轉(zhuǎn)換的值x不在分箱區(qū)間之內(nèi)，很有可能是異常值，不應(yīng)該期望上面的函數(shù)來處理這種情況，而應(yīng)采用專門的異常值處理程序。

評(píng)分卡建模中的使用實(shí)例

下面介紹一下評(píng)分卡建模中的卡方分箱的使用。先來看看數(shù)據(jù)集。

除了y變量外，還有3個(gè)變量：貸款額度（loan_amnt，數(shù)值型），總賬戶數(shù)（total_acc，數(shù)值型），地址州（addr_state，類別型）。

對總賬戶數(shù)total_acc進(jìn)行分箱:

根據(jù)分箱結(jié)果進(jìn)行轉(zhuǎn)換，衍生新的分組變量：

現(xiàn)在已經(jīng)將 total_acc衍生成為新的類別型變量 total_acc_chi2_group ，接下來可以用WOE編碼繼續(xù)加工，然后進(jìn)入模型啦。

python卡方分箱實(shí)戰(zhàn)腳本

對數(shù)據(jù)框中的某個(gè)變量進(jìn)行有監(jiān)督的分箱操作

#python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課：http://dwz.date/b9vv

import pandas as pd

import numpy as np

data = pd.read_csv('sample_data.csv', sep="\t", na_values=['', '?'])

temp = data[['x','y']]

?

# 定義一個(gè)卡方分箱（可設(shè)置參數(shù)置信度水平與箱的個(gè)數(shù)）停止條件為大于置信水平且小于bin的數(shù)目

def ChiMerge(df, variable, flag, confidenceVal=3.841, bin=10, sample = None):?

? ? '''

? ? 運(yùn)行前需要 import pandas as pd 和 import numpy as np

? ? df:傳入一個(gè)數(shù)據(jù)框僅包含一個(gè)需要卡方分箱的變量與正負(fù)樣本標(biāo)識(shí)（正樣本為1，負(fù)樣本為0）

? ? variable:需要卡方分箱的變量名稱（字符串）

? ? flag：正負(fù)樣本標(biāo)識(shí)的名稱（字符串）

? ? confidenceVal：置信度水平（默認(rèn)是不進(jìn)行抽樣95%）

? ? bin：最多箱的數(shù)目

? ? sample: 為抽樣的數(shù)目（默認(rèn)是不進(jìn)行抽樣），因?yàn)槿绻^測值過多運(yùn)行會(huì)較慢

? ? '''

#進(jìn)行是否抽樣操作

? ? if sample != None:

? ? ? ? df = df.sample(n=sample)

? ? else:

? ? ? ? df??

? ? ? ? ?

#進(jìn)行數(shù)據(jù)格式化錄入

? ? total_num = df.groupby([variable])[flag].count()? # 統(tǒng)計(jì)需分箱變量每個(gè)值數(shù)目

? ? total_num = pd.DataFrame({'total_num': total_num})? # 創(chuàng)建一個(gè)數(shù)據(jù)框保存之前的結(jié)果

? ? positive_class = df.groupby([variable])[flag].sum()? # 統(tǒng)計(jì)需分箱變量每個(gè)值正樣本數(shù)

? ? positive_class = pd.DataFrame({'positive_class': positive_class})? # 創(chuàng)建一個(gè)數(shù)據(jù)框保存之前的結(jié)果

? ? regroup = pd.merge(total_num, positive_class, left_index=True, right_index=True,

? ? ? ? ? ? ? ? ? ? ? ?how='inner')? # 組合total_num與positive_class

? ? regroup.reset_index(inplace=True)

? ? regroup['negative_class'] = regroup['total_num'] - regroup['positive_class']? # 統(tǒng)計(jì)需分箱變量每個(gè)值負(fù)樣本數(shù)

? ? regroup = regroup.drop('total_num', axis=1)

? ? np_regroup = np.array(regroup)? # 把數(shù)據(jù)框轉(zhuǎn)化為numpy（提高運(yùn)行效率）

? ? print('已完成數(shù)據(jù)讀入,正在計(jì)算數(shù)據(jù)初處理')

?

#處理連續(xù)沒有正樣本或負(fù)樣本的區(qū)間，并進(jìn)行區(qū)間的合并（以免卡方值計(jì)算報(bào)錯(cuò)）

? ? i = 0

? ? while (i <= np_regroup.shape[0] - 2):

? ? ? ? if ((np_regroup[i, 1] == 0 and np_regroup[i + 1, 1] == 0) or ( np_regroup[i, 2] == 0 and np_regroup[i + 1, 2] == 0)):

? ? ? ? ? ? np_regroup[i, 1] = np_regroup[i, 1] + np_regroup[i + 1, 1]? # 正樣本

? ? ? ? ? ? np_regroup[i, 2] = np_regroup[i, 2] + np_regroup[i + 1, 2]? # 負(fù)樣本

? ? ? ? ? ? np_regroup[i, 0] = np_regroup[i + 1, 0]

? ? ? ? ? ? np_regroup = np.delete(np_regroup, i + 1, 0)

? ? ? ? ? ? i = i - 1

? ? ? ? i = i + 1

??

#對相鄰兩個(gè)區(qū)間進(jìn)行卡方值計(jì)算

? ? chi_table = np.array([])? # 創(chuàng)建一個(gè)數(shù)組保存相鄰兩個(gè)區(qū)間的卡方值

? ? for i in np.arange(np_regroup.shape[0] - 1):

? ? ? ? chi = (np_regroup[i, 1] * np_regroup[i + 1, 2] - np_regroup[i, 2] * np_regroup[i + 1, 1]) ** 2 \

? ? ? ? ? * (np_regroup[i, 1] + np_regroup[i, 2] + np_regroup[i + 1, 1] + np_regroup[i + 1, 2]) / \

? ? ? ? ? ((np_regroup[i, 1] + np_regroup[i, 2]) * (np_regroup[i + 1, 1] + np_regroup[i + 1, 2]) * (

? ? ? ? ? np_regroup[i, 1] + np_regroup[i + 1, 1]) * (np_regroup[i, 2] + np_regroup[i + 1, 2]))

? ? ? ? chi_table = np.append(chi_table, chi)

? ? print('已完成數(shù)據(jù)初處理，正在進(jìn)行卡方分箱核心操作')

?

#把卡方值最小的兩個(gè)區(qū)間進(jìn)行合并（卡方分箱核心）

? ? while (1):

? ? ? ? if (len(chi_table) <= (bin - 1) and min(chi_table) >= confidenceVal):

? ? ? ? ? ? break

? ? ? ? chi_min_index = np.argwhere(chi_table == min(chi_table))[0]? # 找出卡方值最小的位置索引

? ? ? ? np_regroup[chi_min_index, 1] = np_regroup[chi_min_index, 1] + np_regroup[chi_min_index + 1, 1]

? ? ? ? np_regroup[chi_min_index, 2] = np_regroup[chi_min_index, 2] + np_regroup[chi_min_index + 1, 2]

? ? ? ? np_regroup[chi_min_index, 0] = np_regroup[chi_min_index + 1, 0]

? ? ? ? np_regroup = np.delete(np_regroup, chi_min_index + 1, 0)

?

? ? ? ? if (chi_min_index == np_regroup.shape[0] - 1):? # 最小值試最后兩個(gè)區(qū)間的時(shí)候

? ? ? ? ? ? # 計(jì)算合并后當(dāng)前區(qū)間與前一個(gè)區(qū)間的卡方值并替換

? ? ? ? ? ? chi_table[chi_min_index - 1] = (np_regroup[chi_min_index - 1, 1] * np_regroup[chi_min_index, 2] - np_regroup[chi_min_index - 1, 2] * np_regroup[chi_min_index, 1]) ** 2 \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) / \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?((np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index - 1, 2]) * (np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index, 1]) * (np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 2]))

? ? ? ? ? ? # 刪除替換前的卡方值

? ? ? ? ? ? chi_table = np.delete(chi_table, chi_min_index, axis=0)

?

? ? ? ? else:

? ? ? ? ? ? # 計(jì)算合并后當(dāng)前區(qū)間與前一個(gè)區(qū)間的卡方值并替換

? ? ? ? ? ? chi_table[chi_min_index - 1] = (np_regroup[chi_min_index - 1, 1] * np_regroup[chi_min_index, 2] - np_regroup[chi_min_index - 1, 2] * np_regroup[chi_min_index, 1]) ** 2 \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) / \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?((np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index - 1, 2]) * (np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index, 1]) * (np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 2]))

? ? ? ? ? ? # 計(jì)算合并后當(dāng)前區(qū)間與后一個(gè)區(qū)間的卡方值并替換

? ? ? ? ? ? chi_table[chi_min_index] = (np_regroup[chi_min_index, 1] * np_regroup[chi_min_index + 1, 2] - np_regroup[chi_min_index, 2] * np_regroup[chi_min_index + 1, 1]) ** 2 \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* (np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2] + np_regroup[chi_min_index + 1, 1] + np_regroup[chi_min_index + 1, 2]) / \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?((np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index + 1, 1] + np_regroup[chi_min_index + 1, 2]) * (np_regroup[chi_min_index, 1] + np_regroup[chi_min_index + 1, 1]) * (np_regroup[chi_min_index, 2] + np_regroup[chi_min_index + 1, 2]))

? ? ? ? ? ? # 刪除替換前的卡方值

? ? ? ? ? ? chi_table = np.delete(chi_table, chi_min_index + 1, axis=0)

? ? print('已完成卡方分箱核心操作，正在保存結(jié)果')

?

#把結(jié)果保存成一個(gè)數(shù)據(jù)框

? ? result_data = pd.DataFrame()? # 創(chuàng)建一個(gè)保存結(jié)果的數(shù)據(jù)框

? ? result_data['variable'] = [variable] * np_regroup.shape[0]? # 結(jié)果表第一列：變量名

? ? list_temp = []

? ? for i in np.arange(np_regroup.shape[0]):

? ? ? ? if i == 0:

? ? ? ? ? ? x = '0' + ',' + str(np_regroup[i, 0])

? ? ? ? elif i == np_regroup.shape[0] - 1:

? ? ? ? ? ? x = str(np_regroup[i - 1, 0]) + '+'

? ? ? ? else:

? ? ? ? ? ? x = str(np_regroup[i - 1, 0]) + ',' + str(np_regroup[i, 0])

? ? ? ? list_temp.append(x)

? ? result_data['interval'] = list_temp? # 結(jié)果表第二列：區(qū)間

? ? result_data['flag_0'] = np_regroup[:, 2]? # 結(jié)果表第三列：負(fù)樣本數(shù)目

? ? result_data['flag_1'] = np_regroup[:, 1]? # 結(jié)果表第四列：正樣本數(shù)目

? ? return result_data

#調(diào)用函數(shù)參數(shù)示例

bins = ChiMerge(temp, 'x','y', confidenceVal=3.841, bin=10,sample=None)

bins

更多學(xué)習(xí)資料請?jiān)L問python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課：http://dwz.date/b9vv

標(biāo)簽：