Python實(shí)現(xiàn)逐步回歸(stepwise regression)

逐步回歸的基本思想是將變量逐個(gè)引入模型,每引入一個(gè)解釋變量后都要進(jìn)行F檢驗(yàn),并對(duì)已經(jīng)選入的解釋變量逐個(gè)進(jìn)行t檢驗(yàn),當(dāng)原來引入的解釋變量由于后面解釋變量的引入變得不再顯著時(shí),則將其刪除。以確保每次引入新的變量之前回歸方程中只包含顯著性變量。這是一個(gè)反復(fù)的過程,直到既沒有顯著的解釋變量選入回歸方程,也沒有不顯著的解釋變量從回歸方程中剔除為止。以保證最后所得到的解釋變量集是最優(yōu)的。
依據(jù)上述思想,可利用逐步回歸篩選并剔除引起多重共線性的變量,其具體步驟如下:先用被解釋變量對(duì)每一個(gè)所考慮的解釋變量做簡單回歸,然后以對(duì)被解釋變量貢獻(xiàn)最大的解釋變量所對(duì)應(yīng)的回歸方程為基礎(chǔ),再逐步引入其余解釋變量。經(jīng)過逐步回歸,使得最后保留在模型中的解釋變量既是重要的,又沒有嚴(yán)重多重共線性。
?
至今,我們已經(jīng)有更好算法跳過共線性問題,沒有必要在商業(yè)腳本里加入逐步回歸代碼。實(shí)際中變量相互關(guān)聯(lián),因此通過逐步回歸思路刪除變量解決共線性問題并非最佳思路。更好算法可參考課程《python風(fēng)控建模實(shí)戰(zhàn)lendingClub》:http://dwz.date/b626
本文作為學(xué)術(shù)探討,介紹逐步回歸原理和python代碼。當(dāng)基于最小二乘法訓(xùn)練線性回歸模型而發(fā)生過擬合現(xiàn)象時(shí),最小二乘法沒有辦法阻止學(xué)習(xí)過程。前向逐步回歸的引入則可以控制學(xué)習(xí)過程中出現(xiàn)的過擬合,它是最小二乘法的一種改進(jìn)或者說調(diào)整,其基本思想是由少到多地向模型中引入變量,每次增加一個(gè),直到?jīng)]有可以引入的變量為止。最后通過比較在預(yù)留樣本上計(jì)算出的錯(cuò)誤進(jìn)行模型的選擇。
實(shí)現(xiàn)代碼如下:
# 導(dǎo)入要用到的各種包和函數(shù)
import
numpy as np
import
pandas as pd
from
sklearn
import
datasets, linear_model
from
math
import
sqrt
import
matplotlib.pyplot as plt
# 讀入要用到的紅酒數(shù)據(jù)集
wine_data
=
pd.read_csv(
'wine.csv'
)
wine_data.head()
wine_data的表結(jié)構(gòu)如下圖所示:

# 查看紅酒數(shù)據(jù)集的統(tǒng)計(jì)信息
wine_data.describe(
wine_data中部分屬性的統(tǒng)計(jì)信息如下:

# 定義從輸入數(shù)據(jù)集中取指定列作為訓(xùn)練集和測試集的函數(shù)(從取1列一直到取11列):
def
xattrSelect(x, idxSet):
????
xOut
=
[]
????
for
row
in
x:
????????
xOut.append([row[i]
for
i
in
idxSet])
????
return
(xOut)
?
??
?xList
=
[]?
# 構(gòu)造用于存放屬性集的列表
labels
=
[
float
(label)
for
label
in
wine_data.iloc[:,
-
1
].tolist()]?
# 提取出wine_data中的標(biāo)簽集并放入列表中
names
=
wine_data.columns.tolist()?????????
# 提取出wine_data中所有屬性的名稱并放入列表中
for
i
in
range
(
len
(wine_data)):
????
xList.append(wine_data.iloc[i,
0
:
-
1
])???
# 列表xList中的每個(gè)元素對(duì)應(yīng)著wine_data中除去標(biāo)簽列的每一行
?
?# 將原始數(shù)據(jù)集劃分成訓(xùn)練集(占2/3)和測試集(占1/3):
indices
=
range
(
len
(xList))
xListTest
=
[xList[i]
for
i
in
indices
if
i
%
3
=
=
0
]
xListTrain
=
[xList[i]
for
i
in
indices
if
i
%
3
!
=
0
]
labelsTest
=
[labels[i]
for
i
in
indices
if
i
%
3
=
=
0
]
labelsTrain
=
[labels[i]
for
i
in
indices
if
i
%
3
!
=
0
]
?
??
?attributeList
=
[]???????????
# 構(gòu)造用于存放屬性索引的列表
index
=
range
(
len
(xList[
1
]))?
# index用于下面代碼中的外層for循環(huán)
indexSet
=
set
(index)????????
# 構(gòu)造由names中的所有屬性對(duì)應(yīng)的索引構(gòu)成的索引集合
oosError
=
[]????????????????
# 構(gòu)造用于存放下面代碼中的內(nèi)層for循環(huán)每次結(jié)束后最小的RMSE
?
??
?for
i
in
index:
????
attSet
=
set
(attributeList)
????
attTrySet
=
indexSet
-
attSet?????????
# 構(gòu)造由不在attributeList中的屬性索引組成的集合
????
attTry
=
[ii
for
ii
in
attTrySet]?????
# 構(gòu)造由在attTrySet中的屬性索引組成的列表
????
errorList
=
[]
????
attTemp
=
[]
????
?????
for
iTry
in
attTry:
????????
attTemp
=
[]
+
attributeList
????????
attTemp.append(iTry)
????????
?????????
# 調(diào)用attrSelect函數(shù)從xListTrain和xListTest中選取指定的列構(gòu)成暫時(shí)的訓(xùn)練集與測試集
????????
xTrainTemp
=
xattrSelect(xListTrain, attTemp)
????????
xTestTemp
=
xattrSelect(xListTest, attTemp)
????????
?????????
# 將需要用到的訓(xùn)練集和測試集都轉(zhuǎn)化成數(shù)組對(duì)象
????????
xTrain
=
np.array(xTrainTemp)
????????
yTrain
=
np.array(labelsTrain)
????????
xTest
=
np.array(xTestTemp)
????????
yTest
=
np.array(labelsTest)
????????
?????????
# 使用scikit包訓(xùn)練線性回歸模型
????????
wineQModel
=
linear_model.LinearRegression()
????????
wineQModel.fit(xTrain,yTrain)
????????
?????????
# 計(jì)算在測試集上的RMSE
????????
rmsError
=
np.linalg.norm((yTest
-
wineQModel.predict(xTest)),
2
)
/
sqrt(
len
(yTest))
# 利用向量的2范數(shù)計(jì)算RMSE
????????
errorList.append(rmsError)
????????
attTemp
=
[]
????????
?????
iBest
=
np.argmin(errorList)?????????
# 選出errorList中的最小值對(duì)應(yīng)的新索引
????
attributeList.append(attTry[iBest])??
# 利用新索引iBest將attTry中對(duì)應(yīng)的屬性索引添加到attributeList中
????
oosError.append(errorList[iBest])????
# 將errorList中的最小值添加到oosError列表中
?
?print
(
"Out of sample error versus attribute set size"
)
print
(oosError)
print
(
"\n"
+
"Best attribute indices"
)
print
(attributeList)
namesList
=
[names[i]
for
i
in
attributeList]
print
(
"\n"
+
"Best attribute names"
)
print
(namesList)
上述代碼的輸出結(jié)果為:
Out of sample error versus attribute
set
size
[
0.7234259255146811
,
0.6860993152860264
,
0.6734365033445415
,
0.6677033213922949
,
0.6622558568543597
,
0.6590004754171893
,
0.6572717206161052
,
0.6570905806225685
,
0.6569993096464123
,
0.6575818940063175
,
0.6573909869028242
]
?
?Best attribute indices
[
10
,
1
,
9
,
4
,
6
,
8
,
5
,
3
,
2
,
7
,
0
]
?
?Best attribute names
[
'alcohol'
,
'volatile acidity'
,
'sulphates'
,
'chlorides'
,
'total sulfur dioxide'
,
'pH'
,
'free sulfur dioxide'
,
'residual sugar'
,
'citric acid'
,
'density'
,
'fixed acidity'
]
# 繪制由不同數(shù)量的屬性構(gòu)成的線性回歸模型在測試集上的RMSE與屬性數(shù)量的關(guān)系圖像
x
=
range
(
len
(oosError))
plt.plot(x, oosError,
'k'
)
plt.xlabel(
'Number of Attributes'
)
plt.ylabel(
'Error (RMS)'
)
plt.show()
?
?# 繪制由最佳數(shù)量的屬性構(gòu)成的線性回歸模型在測試集上的誤差分布直方圖
indexBest
=
oosError.index(
min
(oosError))
attributesBest
=
attributeList[
1
:(indexBest
+
1
)]??
# attributesBest包含8個(gè)屬性索引
?
?# 調(diào)用xattrSelect函數(shù)從xListTrain和xListTest中選取最佳數(shù)量的列組成暫時(shí)的訓(xùn)練集和測試集
xTrainTemp
=
xattrSelect(xListTrain, attributesBest)
xTestTemp
=
xattrSelect(xListTest, attributesBest)
xTrain
=
np.array(xTrainTemp)???
# 將xTrain轉(zhuǎn)化成數(shù)組對(duì)象
xTest
=
np.array(xTestTemp)?????
# 將xTest轉(zhuǎn)化成數(shù)組對(duì)象
?
?# 訓(xùn)練模型并繪制直方圖
wineQModel
=
linear_model.LinearRegression()
wineQModel.fit(xTrain,yTrain)
errorVector
=
yTest
-
wineQModel.predict(xTest)
plt.hist(errorVector)
plt.xlabel(
"Bin Boundaries"
)
plt.ylabel(
"Counts"
)
plt.show()
?
?# 繪制紅酒口感實(shí)際值與預(yù)測值之間的散點(diǎn)圖
plt.scatter(wineQModel.predict(xTest), yTest, s
=
100
, alpha
=
0.10
)
plt.xlabel(
'Predicted Taste Score'
)
plt.ylabel(
'Actual Taste Score'
)
plt.show()
上述代碼繪制的圖像如下:
? ? RMSE與屬性數(shù)量的關(guān)系圖像:

? ? 錯(cuò)誤直方圖:

實(shí)際值與預(yù)測值的散點(diǎn)圖:

?
逐步回歸練習(xí)和代碼
python信用評(píng)分卡建模(附代碼):
https://ke.qq.com/course/3063615?tuin=dcbf0ba
from
sklearn.datasets
import
load_boston
import
pandas as pd
import
numpy as np
import
statsmodels.api as sm
?
?data
=
load_boston()
X
=
pd.DataFrame(data.data, columns
=
data.feature_names)
y
=
data.target
?
?def
stepwise_selection(X, y,
???????????????????????
initial_list
=
[],
???????????????????????
threshold_in
=
0.01
,
???????????????????????
threshold_out
=
0.05
,
???????????????????????
verbose
=
True
):
????
""" Perform a forward-backward feature selection
????
based on p-value from statsmodels.api.OLS
????
Arguments:
????????
X - pandas.DataFrame with candidate features
????????
y - list-like with the target
????????
initial_list - list of features to start with (column names of X)
????????
threshold_in - include a feature if its p-value < threshold_in
????????
threshold_out - exclude a feature if its p-value > threshold_out
????????
verbose - whether to print the sequence of inclusions and exclusions
????
Returns: list of selected features
????
Always set threshold_in < threshold_out to avoid infinite looping.
????
See https://en.wikipedia.org/wiki/Stepwise_regression for the details
????
"""
????
included
=
list
(initial_list)
?
?????
while
True
:
????????
changed
=
False
????????
# forward step
????????
excluded
=
list
(
set
(X.columns)
-
set
(included))
????????
new_pval
=
pd.Series(index
=
excluded)
????????
for
new_column
in
excluded:
????????????
model
=
sm.OLS(y, sm.add_constant(pd.DataFrame(X[included
+
[new_column]]))).fit()
????????????
new_pval[new_column]
=
model.pvalues[new_column]
????????
best_pval
=
new_pval.
min
()
????????
if
best_pval < threshold_in:
????????????
best_feature
=
new_pval.argmin()
????????????
included.append(best_feature)
????????????
changed
=
True
????????????
if
verbose:
????????????????
print
(
'Add? {:30} with p-value {:.6}'
.
format
(best_feature, best_pval))
????????
# backward step
????????
model
=
sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
????????
# use all coefs except intercept
????????
pvalues
=
model.pvalues.iloc[
1
:]
????????
worst_pval
=
pvalues.
max
()
# null if pvalues is empty
????????
if
worst_pval > threshold_out:
????????????
changed
=
True
????????????
worst_feature
=
pvalues.argmax()
????????????
included.remove(worst_feature)
????????????
if
verbose:
????????????????
print
(
'Drop {:30} with p-value {:.6}'
.
format
(worst_feature, worst_pval))
????????
if
not
changed:
????????????
break
????
return
included
?
?result
=
stepwise_selection(X, y)
?
?print
(
'resulting features:'
)
print
(result)
逐步回歸的基本思想是將變量逐個(gè)引入模型,每引入一個(gè)解釋變量后都要進(jìn)行F檢驗(yàn),并對(duì)已經(jīng)選入的解釋變量逐個(gè)進(jìn)行t檢驗(yàn),當(dāng)原來引入的解釋變量由于后面解釋變量的引入變得不再顯著時(shí),則將其刪除。以確保每次引入新的變量之前回歸方程中只包含顯著性變量。這是一個(gè)反復(fù)的過程,直到既沒有顯著的解釋變量選入回歸方程,也沒有不顯著的解釋變量從回歸方程中剔除為止。以保證最后所得到的解釋變量集是最優(yōu)的。
本例的逐步回歸則有所變化,沒有對(duì)已經(jīng)引入的變量進(jìn)行t檢驗(yàn),只判斷變量是否引入和變量是否剔除,“雙重檢驗(yàn)”逐步回歸,簡稱逐步回歸。例子的鏈接:(原鏈接已經(jīng)失效),4項(xiàng)自變量,1項(xiàng)因變量。下文不再進(jìn)行數(shù)學(xué)推理,進(jìn)對(duì)計(jì)算過程進(jìn)行說明,對(duì)數(shù)學(xué)理論不明白的可以參考《現(xiàn)代中長期水文預(yù)報(bào)方法及其應(yīng)用》湯成友,官學(xué)文,張世明著;論文《逐步回歸模型在大壩預(yù)測中的應(yīng)用》王曉蕾等;
逐步回歸的計(jì)算步驟:
計(jì)算第零步增廣矩陣。第零步增廣矩陣是由預(yù)測因子和預(yù)測對(duì)象兩兩之間的相關(guān)系數(shù)構(gòu)成的。
引進(jìn)因子。在增廣矩陣的基礎(chǔ)上,計(jì)算每個(gè)因子的方差貢獻(xiàn),挑選出沒有進(jìn)入方程的因子中方差貢獻(xiàn)最大者對(duì)應(yīng)的因子,計(jì)算該因子的方差比,查F分布表確定該因子是否引入方程。
剔除因子。計(jì)算此時(shí)方程中已經(jīng)引入的因子的方差貢獻(xiàn),挑選出方差貢獻(xiàn)最小的因子,計(jì)算該因子的方差比,查F分布表確定該因子是否從方程中剔除。
矩陣變換。將第零步矩陣按照引入方程的因子序號(hào)進(jìn)行矩陣變換,變換后的矩陣再次進(jìn)行引進(jìn)因子和剔除因子的步驟,直到無因子可以引進(jìn),也無因子可以剔除為止,終止逐步回歸分析計(jì)算。
a.以下代碼實(shí)現(xiàn)了數(shù)據(jù)的讀取,相關(guān)系數(shù)的計(jì)算子程序和生成第零步增廣矩陣的子程序。
注意:pandas庫讀取csv的數(shù)據(jù)結(jié)構(gòu)為DataFrame結(jié)構(gòu),此處轉(zhuǎn)化為numpy中的(n-dimension array,ndarray)數(shù)組進(jìn)行計(jì)算
import
numpy as np
import
pandas as pd
#數(shù)據(jù)讀取
#利用pandas讀取csv,讀取的數(shù)據(jù)為DataFrame對(duì)象
data
=
pd.read_csv(
'sn.csv'
)
# 將DataFrame對(duì)象轉(zhuǎn)化為數(shù)組,數(shù)組的最后一列為預(yù)報(bào)對(duì)象
data
=
data.values.copy()
# print(data)
?
?# 計(jì)算回歸系數(shù),參數(shù)
def
get_regre_coef(X,Y):
????
S_xy
=
0
????
S_xx
=
0
????
S_yy
=
0
????
# 計(jì)算預(yù)報(bào)因子和預(yù)報(bào)對(duì)象的均值
????
X_mean
=
np.mean(X)
????
Y_mean
=
np.mean(Y)
????
for
i
in
range
(
len
(X)):
????????
S_xy
+
=
(X[i]
-
X_mean)
*
(Y[i]
-
Y_mean)
????????
S_xx
+
=
pow
(X[i]
-
X_mean,
2
)
????????
S_yy
+
=
pow
(Y[i]
-
Y_mean,
2
)
????
return
S_xy
/
pow
(S_xx
*
S_yy,
0.5
)
#構(gòu)建原始增廣矩陣
def
get_original_matrix():
????
# 創(chuàng)建一個(gè)數(shù)組存儲(chǔ)相關(guān)系數(shù),data.shape幾行(維)幾列,結(jié)果用一個(gè)tuple表示
????
# print(data.shape[1])
????
col
=
data.shape[
1
]
????
# print(col)
????
r
=
np.ones((col,col))
#np.ones參數(shù)為一個(gè)元組(tuple)
????
# print(np.ones((col,col)))
????
# for row in data.T:#運(yùn)用數(shù)組的迭代,只能迭代行,迭代轉(zhuǎn)置后的數(shù)組,結(jié)果再進(jìn)行轉(zhuǎn)置就相當(dāng)于迭代了每一列
????????
# print(row.T)
????
for
i
in
range
(col):
????????
for
j
in
range
(col):
????????????
r[i,j]
=
get_regre_coef(data[:,i],data[:,j])
????
return
r
b.第二部分主要是計(jì)算公差貢獻(xiàn)和方差比。
def
get_vari_contri(r):
????
col
=
data.shape[
1
]
?????
#創(chuàng)建一個(gè)矩陣來存儲(chǔ)方差貢獻(xiàn)值
????
v
=
np.ones((
1
,col
-
1
))
????
# print(v)
????
for
i
in
range
(col
-
1
):
????????
# v[0,i]=pow(r[i,col-1],2)/r[i,i]
????????
v[
0
, i]
=
pow
(r[i, col
-
1
],
2
)
/
r[i, i]
????
return
v
#選擇因子是否進(jìn)入方程,
#參數(shù)說明:r為增廣矩陣,v為方差貢獻(xiàn)值,k為方差貢獻(xiàn)值最大的因子下標(biāo),p為當(dāng)前進(jìn)入方程的因子數(shù)
def
select_factor(r,v,k,p):
????
row
=
data.shape[
0
]
#樣本容量
????
col
=
data.shape[
1
]
-
1
#預(yù)報(bào)因子數(shù)
????
#計(jì)算方差比
????
f
=
(row
-
p
-
2
)
*
v[
0
,k
-
1
]
/
(r[col,col]
-
v[
0
,k
-
1
])
????
# print(calc_vari_contri(r))
????
return
f
c.第三部分調(diào)用定義的函數(shù)計(jì)算方差貢獻(xiàn)值
#計(jì)算第零步增廣矩陣
r
=
get_original_matrix()
# print(r)
#計(jì)算方差貢獻(xiàn)值
v
=
get_vari_contri(r)
print
(v)
#計(jì)算方差比
計(jì)算結(jié)果:

此處沒有編寫判斷方差貢獻(xiàn)最大的子程序,因?yàn)樵谄渌?jì)算中我還需要變量的具體物理含義所以不能單純的由計(jì)算決定對(duì)變量的取舍,此處看出第四個(gè)變量的方查貢獻(xiàn)最大
# #計(jì)算方差比
# print(data.shape[0])
f
=
select_factor(r,v,
4
,
0
)
print
(f)
#######輸出##########
22.79852020138227
計(jì)算第四個(gè)預(yù)測因子的方差比(粘貼在了代碼中),并查F分布表3.280進(jìn)行比對(duì),22.8>3.28,引入第四個(gè)預(yù)報(bào)因子。(前三次不進(jìn)行剔除椅子的計(jì)算)
d.第四部分進(jìn)行矩陣的變換。
#逐步回歸分析與計(jì)算
#通過矩陣轉(zhuǎn)換公式來計(jì)算各部分增廣矩陣的元素值
def
convert_matrix(r,k):
????
col
=
data.shape[
1
]
????
k
=
k
-
1
#從第零行開始計(jì)數(shù)
????
#第k行的元素單不屬于k列的元素
????
r1
=
np.ones((col, col))?
# np.ones參數(shù)為一個(gè)元組(tuple)
????
for
i
in
range
(col):
????????
for
j
in
range
(col):
????????????
if
(i
=
=
k
and
j!
=
k):
????????????????
r1[i,j]
=
r[k,j]
/
r[k,k]
????????????
elif
(i!
=
k
and
j!
=
k):
????????????????
r1[i,j]
=
r[i,j]
-
r[i,k]
*
r[k,j]
/
r[k,k]
????????????
elif
(i!
=
k
and
j
=
=
k):
????????????????
r1[i,j]
=
-
r[i,k]
/
r[k,k]
????????????
else
:
????????????????
r1[i,j]
=
1
/
r[k,k]
e.進(jìn)行完矩陣變換就循環(huán)上面步驟進(jìn)行因子的引入和剔除
再次計(jì)算各因子的方差貢獻(xiàn)

前三個(gè)未引入方程的方差因子進(jìn)行排序,得到第一個(gè)因子的方差貢獻(xiàn)最大,計(jì)算第一個(gè)預(yù)報(bào)因子的F檢驗(yàn)值,大于臨界值引入第一個(gè)預(yù)報(bào)因子進(jìn)入方程。
#矩陣轉(zhuǎn)換,計(jì)算第一步矩陣
r
=
convert_matrix(r,
4
)
# print(r)
#計(jì)算第一步方差貢獻(xiàn)值
v
=
get_vari_contri(r)
#print(v)
f
=
select_factor(r,v,
1
,
1
)
print
(f)
#########輸出#####
108.22390933074443
進(jìn)行矩陣變換,計(jì)算方差貢獻(xiàn)

可以看出還沒有引入方程的因子2和3,方差貢獻(xiàn)較大的是因子2,計(jì)算因子2的f檢驗(yàn)值5.026>3.28,故引入預(yù)報(bào)因子2
f
=
select_factor(r,v,
2
,
2
)
print
(f)
##########輸出#########
5.025864648951804
繼續(xù)進(jìn)行矩陣轉(zhuǎn)換,計(jì)算方差貢獻(xiàn)

這一步需要考慮剔除因子了,有方差貢獻(xiàn)可以知道,已引入方程的因子中方差貢獻(xiàn)最小的是因子4,分別計(jì)算因子3的引進(jìn)f檢驗(yàn)值0.0183
和因子4的剔除f檢驗(yàn)值1.863,均小于3.28(查F分布表)因子3不能引入,因子4需要剔除,此時(shí)方程中引入的因子數(shù)為2
#選擇是否剔除因子,
#參數(shù)說明:r為增廣矩陣,v為方差貢獻(xiàn)值,k為方差貢獻(xiàn)值最大的因子下標(biāo),t為當(dāng)前進(jìn)入方程的因子數(shù)
def
delete_factor(r,v,k,t):
????
row
=
data.shape[
0
]?
# 樣本容量
????
col
=
data.shape[
1
]
-
1
? # 預(yù)報(bào)因子數(shù)
????
# 計(jì)算方差比
????
f
=
(row
-
t
-
1
)
*
v[
0
, k
-
1
]
/
r[col, col]
????
# print(calc_vari_contri(r))
????
return
f
#因子3的引進(jìn)檢驗(yàn)值0.018233473487350636
f
=
select_factor(r,v,
3
,
3
)
print
(f)
#因子4的剔除檢驗(yàn)值1.863262422188088
f
=
delete_factor(r,v,
4
,
3
)
print
(f)
在此對(duì)矩陣進(jìn)行變換,計(jì)算方差貢獻(xiàn)

,已引入因子(因子1和2)方差貢獻(xiàn)最小的是因子1,為引入因子方差貢獻(xiàn)最大的是因子4,計(jì)算這兩者的引進(jìn)f檢驗(yàn)值和剔除f檢驗(yàn)值
#因子4的引進(jìn)檢驗(yàn)值1.8632624221880876,小于3.28不能引進(jìn)
f
=
select_factor(r,v,
4
,
2
)
print
(f)
#因子1的剔除檢驗(yàn)值146.52265486251397,大于3.28不能剔除
f
=
delete_factor(r,v,
1
,
2
)
print
(f)
不能剔除也不能引進(jìn)變量,此時(shí)停止逐步回歸的計(jì)算。引進(jìn)方程的因子為預(yù)報(bào)因子1和預(yù)報(bào)因子2,借助上一篇博客寫的多元回歸。對(duì)進(jìn)入方程的預(yù)報(bào)因子和預(yù)報(bào)對(duì)象進(jìn)行多元回歸。輸出多元回歸的預(yù)測結(jié)果,一次為常數(shù)項(xiàng),第一個(gè)因子的預(yù)測系數(shù),第二個(gè)因子的預(yù)測系數(shù)。
#因子1和因子2進(jìn)入方程
#對(duì)進(jìn)入方程的預(yù)報(bào)因子進(jìn)行多元回歸
# regs=LinearRegression()
X
=
data[:,
0
:
2
]
Y
=
data[:,
4
]
X
=
np.mat(np.c_[np.ones(X.shape[
0
]),X])
#為系數(shù)矩陣增加常數(shù)項(xiàng)系數(shù)
Y
=
np.mat(Y)
#數(shù)組轉(zhuǎn)化為矩陣
#print(X)
B
=
np.linalg.inv(X.T
*
X)
*
(X.T)
*
(Y.T)
print
(B.T)
#輸出系數(shù),第一項(xiàng)為常數(shù)項(xiàng),其他為回歸系數(shù)
###輸出##
#[[52.57734888? 1.46830574? 0.66225049]]
參考:
《Python機(jī)器學(xué)習(xí)——預(yù)測分析核心算法》Michael Bowles著
https://baike.baidu.com/item/%E9%80%90%E6%AD%A5%E5%9B%9E%E5%BD%92/585832?fr=aladdin
https://blog.csdn.net/qq_41080850/article/details/86506408
https://blog.csdn.net/qq_41080850/article/details/86764534
https://blog.csdn.net/Will_Zhan/article/details/83311049
如果想了解更多相關(guān)知識(shí),歡迎同學(xué)報(bào)名學(xué)習(xí)python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析:http://dwz.date/b9vv