R語言廣義線性模型索賠頻率預測:過度分散、風險暴露數(shù)和樹狀圖可視化
原文鏈接:http://tecdat.cn/?p=13963

在精算科學和保險費率制定中,考慮到風險敞口可能是一場噩夢。不知何故,簡單的結(jié)果是因為計算起來更加復雜,只是因為我們必須考慮到暴露是一個異構(gòu)變量這一事實。
?
保險費率制定中的風險敞口可以看作是審查數(shù)據(jù)的問題(在我的數(shù)據(jù)集中,風險敞口始終小于1,因為觀察結(jié)果是合同,而不是保單持有人),利息變量是未觀察到的變量,因為我們必須為保險合同定價一年(整年)的保險期。因此,我們必須對保險索賠的年度頻率進行建模。

?
在我們的數(shù)據(jù)集中,我們考慮索賠總數(shù)與總風險承擔比率。例如,如果我們考慮泊松過程,可能性是
即
?
?
因此,我們有一個預期值的估算,一個自然估算?。
現(xiàn)在,我們需要估算方差,更準確地說是條件變量。
這可以用來檢驗泊松假設是否對頻率建模有效??紤]以下數(shù)據(jù)集,
> ?nombre=rbind(nombre1,nombre2)
> ?baseFREQ = merge(contrat,nombre)
在這里,我們確實有兩個感興趣的變量,即每張合約的敞口,
> ?E <- baseFREQ$exposition
和(觀察到的)索賠數(shù)量(在該時間段內(nèi))
> ?Y <- baseFREQ$nbre
無需協(xié)變量,可以計算每個合同的平均(每年)索賠數(shù)量以及相關(guān)的方差
> (mean=weighted.mean(Y/E,E))
[1] 0.07279295
> (variance=sum((Y-mean*E)^2)/sum(E))
[1] 0.08778567
看起來方差(略)大于平均值(我們將在幾周后看到如何更正式地對其進行測試)??梢栽诒纬钟腥司幼〉牡貐^(qū)添加協(xié)變量,例如人口密度,
Density, zone 11 average = 0.07962411 ?variance = 0.08711477
Density, zone 21 average = 0.05294927 ?variance = 0.07378567
Density, zone 22 average = 0.09330982 ?variance = 0.09582698
Density, zone 23 average = 0.06918033 ?variance = 0.07641805
Density, zone 24 average = 0.06004009 ?variance = 0.06293811
Density, zone 25 average = 0.06577788 ?variance = 0.06726093
Density, zone 26 average = 0.0688496 ? variance = 0.07126078
Density, zone 31 average = 0.07725273 ?variance = 0.09067
Density, zone 41 average = 0.03649222 ?variance = 0.03914317
Density, zone 42 average = 0.08333333 ?variance = 0.1004027
Density, zone 43 average = 0.07304602 ?variance = 0.07209618
Density, zone 52 average = 0.06893741 ?variance = 0.07178091
Density, zone 53 average = 0.07725661 ?variance = 0.07811935
Density, zone 54 average = 0.07816105 ?variance = 0.08947993
Density, zone 72 average = 0.08579731 ?variance = 0.09693305
Density, zone 73 average = 0.04943033 ?variance = 0.04835521
Density, zone 74 average = 0.1188611 ? variance = 0.1221675
Density, zone 82 average = 0.09345635 ?variance = 0.09917425
Density, zone 83 average = 0.04299708 ?variance = 0.05259835
Density, zone 91 average = 0.07468126 ?variance = 0.3045718
Density, zone 93 average = 0.08197912 ?variance = 0.09350102
Density, zone 94 average = 0.03140971 ?variance = 0.04672329
可以可視化該信息
> plot(meani,variancei,cex=sqrt(Ei),col="grey",pch=19,
+ xlab="Empirical average",ylab="Empirical variance")
> points(meani,variancei,cex=sqrt(Ei))
?

圓圈的大小與組的大小有關(guān)(面積與組內(nèi)的總暴露量成正比)。第一個對角線對應于泊松模型,即方差應等于均值。也可以考慮其他協(xié)變量

?
或汽車品牌,

?
也可以將駕駛員的年齡視為分類變量

讓我們更仔細地看一下不同年齡段的人,
?

在右邊,我們可以觀察到年輕的(沒有經(jīng)驗的)駕駛員。那是預料之中的。但是有些類別??低于??第一個對角線:期望的頻率很大,但方差不大。也就是說,我們??可以肯定的??是,年輕的駕駛員會發(fā)生更多的車禍。相反,它不是一個異類:年輕的駕駛員可以看作是一個相對同質(zhì)的類,發(fā)生車禍的頻率很高。
使用原始數(shù)據(jù)集(在這里,我僅使用具有50,000個客戶的子集),我們確實獲得了以下圖形:

?
由于圈正在從18歲下降到25歲,因此具有明顯的經(jīng)驗影響。
同時我們可以發(fā)現(xiàn)有可能將曝光量視為標準變量,并查看系數(shù)實際上是否等于1。如果沒有任何協(xié)變量,
Call:
glm(formula = Y ~ log(E), family = poisson("log"))
Deviance Residuals:
Min ? ? ? 1Q ? Median ? ? ? 3Q ? ? ?Max
-0.3988 ?-0.3388 ?-0.2786 ?-0.1981 ?12.9036
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.83045 ? ?0.02822 -100.31 ? <2e-16 ***
log(E) ? ? ? 0.53950 ? ?0.02905 ? 18.57 ? <2e-16 ***
---
Signif. codes: ?0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 12931 ?on 49999 ?degrees of freedom
Residual deviance: 12475 ?on 49998 ?degrees of freedom
AIC: 16150
Number of Fisher Scoring iterations: 6
也就是說,該參數(shù)顯然嚴格小于1。它與重要性均不相關(guān),
Linear hypothesis test
Hypothesis:
log(E) = 1
Model 1: restricted model
Model 2: Y ~ log(E)
Res.Df Df ?Chisq Pr(>Chisq)
1 ?49999
2 ?49998 ?1 251.19 ?< 2.2e-16 ***
---
Signif. codes: ?0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
我也沒有考慮協(xié)變量,
Deviance Residuals:
Min ? ? ? 1Q ? Median ? ? ? 3Q ? ? ?Max
-0.7114 ?-0.3200 ?-0.2637 ?-0.1896 ?12.7104
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) ? ? ? ? ? ? ? ? ?-14.07321 ?181.04892 ?-0.078 0.938042
log(exposition) ? ? ? ? ? ? ? ?0.56781 ? ?0.03029 ?18.744 ?< 2e-16 ***
carburantE ? ? ? ? ? ? ? ? ? ?-0.17979 ? ?0.04630 ?-3.883 0.000103 ***
as.factor(ageconducteur)19 ? ?12.18354 ?181.04915 ? 0.067 0.946348
as.factor(ageconducteur)20 ? ?12.48752 ?181.04902 ? 0.069 0.945011
因此,假設暴露是此處的外生變量可能是一個過強的假設。
接下來我們開始討論建模索賠頻率時的過度分散。在前面,我討論了具有不同暴露程度的經(jīng)驗方差的計算。但是我只使用一個因素來計算類。當然,可以使用更多的因素。例如,使用因子的笛卡爾積,
Class D A (17,24] ?average = 0.06274415 ?variance = 0.06174966
Class D A (24,40] ?average = 0.07271905 ?variance = 0.07675049
Class D A (40,65] ?average = 0.05432262 ?variance = 0.06556844
Class D A (65,101] average = 0.03026999 ?variance = 0.02960885
Class D B (17,24] ?average = 0.2383109 ? variance = 0.2442396
Class D B (24,40] ?average = 0.06662015 ?variance = 0.07121064
Class D B (40,65] ?average = 0.05551854 ?variance = 0.05543831
Class D B (65,101] average = 0.0556386 ? variance = 0.0540786
Class D C (17,24] ?average = 0.1524552 ? variance = 0.1592623
Class D C (24,40] ?average = 0.0795852 ? variance = 0.09091435
Class D C (40,65] ?average = 0.07554481 ?variance = 0.08263404
Class D C (65,101] average = 0.06936605 ?variance = 0.06684982
Class D D (17,24] ?average = 0.1584052 ? variance = 0.1552583
Class D D (24,40] ?average = 0.1079038 ? variance = 0.121747
Class D D (40,65] ?average = 0.06989518 ?variance = 0.07780811
Class D D (65,101] average = 0.0470501 ? variance = 0.04575461
Class D E (17,24] ?average = 0.2007164 ? variance = 0.2647663
Class D E (24,40] ?average = 0.1121569 ? variance = 0.1172205
Class D E (40,65] ?average = 0.106563 ? ?variance = 0.1068348
Class D E (65,101] average = 0.1572701 ? variance = 0.2126338
Class D F (17,24] ?average = 0.2314815 ? variance = 0.1616788
Class D F (24,40] ?average = 0.1690485 ? variance = 0.1443094
Class D F (40,65] ?average = 0.08496827 ?variance = 0.07914423
Class D F (65,101] average = 0.1547769 ? variance = 0.1442915
Class E A (17,24] ?average = 0.1275345 ? variance = 0.1171678
Class E A (24,40] ?average = 0.04523504 ?variance = 0.04741449
Class E A (40,65] ?average = 0.05402834 ?variance = 0.05427582
Class E A (65,101] average = 0.04176129 ?variance = 0.04539265
Class E B (17,24] ?average = 0.1114712 ? variance = 0.1059153
Class E B (24,40] ?average = 0.04211314 ?variance = 0.04068724
Class E B (40,65] ?average = 0.04987117 ?variance = 0.05096601
Class E B (65,101] average = 0.03123003 ?variance = 0.03041192
Class E C (17,24] ?average = 0.1256302 ? variance = 0.1310862
Class E C (24,40] ?average = 0.05118006 ?variance = 0.05122782
Class E C (40,65] ?average = 0.05394576 ?variance = 0.05594004
Class E C (65,101] average = 0.04570239 ?variance = 0.04422991
Class E D (17,24] ?average = 0.1777142 ? variance = 0.1917696
Class E D (24,40] ?average = 0.06293331 ?variance = 0.06738658
Class E D (40,65] ?average = 0.08532688 ?variance = 0.2378571
Class E D (65,101] average = 0.05442916 ?variance = 0.05724951
Class E E (17,24] ?average = 0.1826558 ? variance = 0.2085505
Class E E (24,40] ?average = 0.07804062 ?variance = 0.09637156
Class E E (40,65] ?average = 0.08191469 ?variance = 0.08791804
Class E E (65,101] average = 0.1017367 ? variance = 0.1141004
Class E F (17,24] ?average = 0 ? ? ? ? ? variance = 0
Class E F (24,40] ?average = 0.07731177 ?variance = 0.07415932
Class E F (40,65] ?average = 0.1081142 ? variance = 0.1074324
Class E F (65,101] average = 0.09071118 ?variance = 0.1170159
同樣,可以將方差與平均值作圖,
> plot(vm,vv,cex=sqrt(ve),col="grey",pch=19,
+ xlab="Empirical average",ylab="Empirical variance")
> points(vm,vv,cex=sqrt(ve))
> abline(a=0,b=1,lty=2)

?
一種替代方法是使用樹。樹可以從其他變量獲得,但它應該是相當接近我們理想的模型。在這里,我確實使用了整個數(shù)據(jù)庫(超過60萬行)
樹如下
> plot(T)
> text(T)

?
現(xiàn)在,每個分支都定義了一個類,可以使用它來定義一個類。應該被認為是同質(zhì)的。
Class ?6 average = ? 0.04010406 ?variance = 0.04424163
Class ?8 average = ? 0.05191127 ?variance = 0.05948133
Class ?9 average = ? 0.07442635 ?variance = 0.08694552
Class ?10 average = ?0.4143646 ? variance = 0.4494002
Class ?11 average = ?0.1917445 ? variance = 0.1744355
Class ?15 average = ?0.04754595 ?variance = 0.05389675
Class ?20 average = ?0.08129577 ?variance = 0.0906322
Class ?22 average = ?0.05813419 ?variance = 0.07089811
Class ?23 average = ?0.06123807 ?variance = 0.07010473
Class ?24 average = ?0.06707301 ?variance = 0.07270995
Class ?25 average = ?0.3164557 ? variance = 0.2026906
Class ?26 average = ?0.08705041 ?variance = 0.108456
Class ?27 average = ?0.06705214 ?variance = 0.07174673
Class ?30 average = ?0.05292652 ?variance = 0.06127301
Class ?31 average = ?0.07195285 ?variance = 0.08620593
Class ?32 average = ?0.08133722 ?variance = 0.08960552
Class ?34 average = ?0.1831559 ? variance = 0.2010849
Class ?39 average = ?0.06173885 ?variance = 0.06573939
Class ?41 average = ?0.07089419 ?variance = 0.07102932
Class ?44 average = ?0.09426152 ?variance = 0.1032255
Class ?47 average = ?0.03641669 ?variance = 0.03869702
Class ?49 average = ?0.0506601 ? variance = 0.05089276
Class ?50 average = ?0.06373107 ?variance = 0.06536792
Class ?51 average = ?0.06762947 ?variance = 0.06926191
Class ?56 average = ?0.06771764 ?variance = 0.07122379
Class ?57 average = ?0.04949142 ?variance = 0.05086885
Class ?58 average = ?0.2459016 ? variance = 0.2451116
Class ?59 average = ?0.05996851 ?variance = 0.0615773
Class ?61 average = ?0.07458053 ?variance = 0.0818608
Class ?63 average = ?0.06203737 ?variance = 0.06249892
Class ?64 average = ?0.07321618 ?variance = 0.07603106
Class ?66 average = ?0.07332127 ?variance = 0.07262425
Class ?68 average = ?0.07478147 ?variance = 0.07884597
Class ?70 average = ?0.06566728 ?variance = 0.06749411
Class ?71 average = ?0.09159605 ?variance = 0.09434413
Class ?75 average = ?0.03228927 ?variance = 0.03403198
Class ?76 average = ?0.04630848 ?variance = 0.04861813
Class ?78 average = ?0.05342351 ?variance = 0.05626653
Class ?79 average = ?0.05778622 ?variance = 0.05987139
Class ?80 average = ?0.0374993 ? variance = 0.0385351
Class ?83 average = ?0.06721729 ?variance = 0.07295168
Class ?86 average = ?0.09888492 ?variance = 0.1131409
Class ?87 average = ?0.1019186 ? variance = 0.2051122
Class ?88 average = ?0.05281703 ?variance = 0.0635244
Class ?91 average = ?0.08332136 ?variance = 0.09067632
Class ?96 average = ?0.07682093 ?variance = 0.08144446
Class ?97 average = ?0.0792268 ? variance = 0.08092019
Class ?99 average = ?0.1019089 ? variance = 0.1072126
Class ?100 average = 0.1018262 ? variance = 0.1081117
Class ?101 average = 0.1106647 ? variance = 0.1151819
Class ?103 average = 0.08147644 ?variance = 0.08411685
Class ?104 average = 0.06456508 ?variance = 0.06801061
Class ?107 average = 0.1197225 ? variance = 0.1250056
Class ?108 average = 0.0924619 ? variance = 0.09845582
Class ?109 average = 0.1198932 ? variance = 0.1209162
在這里,當根據(jù)索賠的經(jīng)驗平均值繪制經(jīng)驗方差時,我們得到

?
在這里,我們可以識別剩余異質(zhì)性的類。
?