python機器學(xué)習(xí)-乳腺癌細胞挖掘(五)

python機器學(xué)習(xí)-乳腺癌細胞挖掘:http://dwz.date/bwey

?模型驗證
分類器好壞驗證,模型建立好后,不是萬事大吉,需要進行crossvalidation, AUC,GINi,KS,GainTable檢驗
KS可以檢測模型區(qū)分好壞客戶能力,如果有一個分數(shù)段區(qū)分能力強,KS會大于0.2
AUC檢測模型分類器效果,分類器敏感度越高,AUC越大,一般AUC大于0.7,分類器準確性就不錯。
Gain Table可以檢測模型收益情況和排序能力
?
?模型驗證中數(shù)據(jù)要拆分為train(訓(xùn)練),test(測試),oot(跨時間)
train和test是同一個時間段,一般三七開,train占百分之70,test占百分之30
oot的時間段在train,test后面,用于測試未來數(shù)據(jù)

下圖是模型驗證的可視化:
包括ROC,提升圖,KS,PSI四個指標(biāo)

由于時間關(guān)系,我們只詳細說明一下ROC/AUC檢驗
auc分數(shù)有兩種計算方式,第一種是根據(jù)目標(biāo)變量y_true,預(yù)測分數(shù)/預(yù)測概率y_socres,通過roc_auc_score(y_true, y_scores)計算AUC
?
第二種方法是通過fpr,tpr,通過auc(fpr,tpr)來計算AUC

excel 繪圖ROC

ROC的前置條件是分數(shù)越高,陽性率越高,但風(fēng)控模型中,有的分數(shù)越低,壞客戶概率越高,例如蜜罐分數(shù),因此ROC繪制出來是反的,需要對陽性標(biāo)簽反轉(zhuǎn)pos_label=0?

由于分數(shù)越低,壞客戶概率越高,畫出來的ROC曲線是反轉(zhuǎn)的,需要糾正

UC/ROC檢驗代碼
# -*- coding: utf-8 -*-
"""
Created on Thu Apr 12 22:31:31 2018
微信公眾號:pythonEducation?
?
@author: 231469242@qq.com
"""
import
?numpy as np
from
?sklearn?
import
?metrics
from
?sklearn.metrics?
import
?roc_curve, auc,roc_auc_score??
###計算roc和auc
?
?
import
?pandas as pd
from
?sklearn.datasets?
import
?load_breast_cancer
from
?sklearn.neighbors?
import
?KNeighborsClassifier
from
?sklearn.model_selection?
import
?train_test_split
import
?matplotlib.pyplot as plt
import
?mglearn
import
?matplotlib.pyplot as plt
?
?
cancer
=
load_breast_cancer()
?
?
#mglearn.plots.plot_knn_classification(n_neighbors=3)
X_train,x_test,y_train,y_test
=
train_test_split(cancer.data,cancer.target,stratify
=
cancer.target,random_state
=
42
)
?
?
knn
=
KNeighborsClassifier()
knn.fit(X_train,y_train)
print
(
"accuracy on the training subset:{:.3f}"
.
format
(knn.score(X_train,y_train)))
print
(
"accuracy on the test subset:{:.3f}"
.
format
(knn.score(x_test,y_test)))
?
?
#Auc驗證,數(shù)據(jù)采用測試集數(shù)據(jù)
#癌癥的概率
proba_cancer
=
knn.predict_proba(x_test)
y_scores
=
pd.DataFrame(proba_cancer)[
1
]
y_scores
=
np.array(y_scores)
y_true
=
y_test
#auc分數(shù)
#auc分數(shù)有兩種計算方式,第一種是根據(jù)目標(biāo)變量y_true,預(yù)測分數(shù)/預(yù)測概率y_socres,通過roc_auc_score(y_true, y_scores)計算AUC
AUC
=
roc_auc_score(y_true, y_scores)
print
(
"AUC:"
,AUC)
#auc第二種方法是通過fpr,tpr,通過auc(fpr,tpr)來計算AUC
fpr, tpr, thresholds?
=
?metrics.roc_curve(y_true, y_scores, pos_label
=
1
)
AUC1?
=
?auc(fpr,tpr)?
###計算auc的值
?
?
#print("fpr:",fpr)
#print("tpr:",tpr)
#print("thresholds:",thresholds)
print
(
"AUC1:"
,AUC1)
?
?
if
?AUC >
=
0.7
:
????
print
(
"good classifier"
)
if
?0.7
>AUC>
0.6
:
????
print
(
"not very good classifier"
)
if
?0.6
>
=
AUC>
0.5
:
????
print
(
"useless classifier"
)
if
?0.5
>
=
AUC:
????
print
(
"bad classifier,with sorting problems"
)
?????
?
?
?
#繪制ROC曲線
#畫對角線
plt.plot([
0
,?
1
], [
0
,?
1
],?
'--'
, color
=
(
0.6
,?
0.6
,?
0.6
), label
=
'Diagonal line'
)
plt.plot(fpr,tpr,label
=
'ROC curve (area = %0.2f)'
?%
?AUC)
plt.title(
'ROC curve'
)?
plt.legend(loc
=
"lower right"
)??
模型驗證知識非常多,這里就做一個簡短介紹。歡迎各位同學(xué)學(xué)習(xí)我的python機器學(xué)習(xí)生物信息學(xué)系列課,網(wǎng)址如下:http://dwz.date/b9vw
