R語言實現(xiàn)偏最小二乘回歸法 partial least squares (PLS)回歸

2021-02-25 22:37 作者:拓端tecdat 0人讀過 | 我要投稿

原文鏈接：http://tecdat.cn/?p=8652

偏最小二乘回歸是一種回歸形式。? 當(dāng)使用pls時，新的線性組合有助于解釋模型中的自變量和因變量。

在本文中，我們將使用pls在“ Mroz”數(shù)據(jù)集中使用預(yù)測“收入”。??

library(pls);library(Ecdat)

data("Mroz") str(Mroz)

## 'data.frame': ? ?753 obs. of ?18 variables: ## ?$ work ? ? ?: Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ... ## ?$ hoursw ? ?: int ?1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ... ## ?$ child6 ? ?: int ?1 0 1 0 1 0 0 0 0 0 ... ## ?$ child618 ?: int ?0 2 3 3 2 0 2 0 2 2 ... ## ?$ agew ? ? ?: int ?32 30 35 34 31 54 37 54 48 39 ... ## ?$ educw ? ? : int ?12 12 12 12 14 12 16 12 12 12 ... ## ?$ hearnw ? ?: num ?3.35 1.39 4.55 1.1 4.59 ... ## ?$ wagew ? ? : num ?2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ... ## ?$ hoursh ? ?: int ?2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ... ## ?$ ageh ? ? ?: int ?34 30 40 53 32 57 37 53 52 43 ... ## ?$ educh ? ? : int ?12 9 12 10 12 11 12 8 4 12 ... ## ?$ wageh ? ? : num ?4.03 8.44 3.58 3.54 10 ... ## ?$ income ? ?: int ?16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ... ## ?$ educwm ? ?: int ?12 7 12 7 12 14 14 3 7 7 ... ## ?$ educwf ? ?: int ?7 7 7 7 14 7 7 3 7 7 ... ## ?$ unemprate : num ?5 11 5 5 9.5 7.5 5 5 3 5 ... ## ?$ city ? ? ?: Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ... ## ?$ experience: int ?14 5 15 6 7 33 11 35 24 21 ...

首先，我們必須通過將數(shù)據(jù)分為訓(xùn)練和測試集來準(zhǔn)備數(shù)據(jù)。?

set.seed(777) train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split test<-(!train)

在上面的代碼中，我們設(shè)置“ set.seed函數(shù)”以確保重復(fù)。然后，我們創(chuàng)建了“ train”對象。?

現(xiàn)在，我們使用 “ plsr”函數(shù)創(chuàng)建模型，然后使用“ summary”函數(shù)檢查結(jié)果。我們? 使用交叉驗證。下面是代碼。

## Data: ? ?X dimension: 392 17 ## ?Y dimension: 392 1 ## Fit method: kernelpls ## Number of components considered: 17 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## ? ? ? ?(Intercept) ?1 comps ?2 comps ?3 comps ?4 comps ?5 comps ?6 comps ## CV ? ? ? ? ? 11218 ? ? 8121 ? ? 6701 ? ? 6127 ? ? 5952 ? ? 5886 ? ? 5857 ## adjCV ? ? ? ?11218 ? ? 8114 ? ? 6683 ? ? 6108 ? ? 5941 ? ? 5872 ? ? 5842 ## ? ? ? ?7 comps ?8 comps ?9 comps ?10 comps ?11 comps ?12 comps ?13 comps ## CV ? ? ? ?5853 ? ? 5849 ? ? 5854 ? ? ?5853 ? ? ?5853 ? ? ?5852 ? ? ?5852 ## adjCV ? ? 5837 ? ? 5833 ? ? 5837 ? ? ?5836 ? ? ?5836 ? ? ?5835 ? ? ?5835 ## ? ? ? ?14 comps ?15 comps ?16 comps ?17 comps ## CV ? ? ? ? 5852 ? ? ?5852 ? ? ?5852 ? ? ?5852 ## adjCV ? ? ?5835 ? ? ?5835 ? ? ?5835 ? ? ?5835 ## ## TRAINING: % variance explained ## ? ? ? ? 1 comps ?2 comps ?3 comps ?4 comps ?5 comps ?6 comps ?7 comps ## X ? ? ? ? 17.04 ? ?26.64 ? ?37.18 ? ?49.16 ? ?59.63 ? ?64.63 ? ?69.13 ## income ? ?49.26 ? ?66.63 ? ?72.75 ? ?74.16 ? ?74.87 ? ?75.25 ? ?75.44 ## ? ? ? ? 8 comps ?9 comps ?10 comps ?11 comps ?12 comps ?13 comps ?14 comps ## X ? ? ? ? 72.82 ? ?76.06 ? ? 78.59 ? ? 81.79 ? ? 85.52 ? ? 89.55 ? ? 92.14 ## income ? ?75.49 ? ?75.51 ? ? 75.51 ? ? 75.52 ? ? 75.52 ? ? 75.52 ? ? 75.52 ## ? ? ? ? 15 comps ?16 comps ?17 comps ## X ? ? ? ? ?94.88 ? ? 97.62 ? ?100.00 ## income ? ? 75.52 ? ? 75.52 ? ? 75.52

打印輸出包括“驗證”部分中均方根誤差。因為有17個獨立變量，所以有17個組件。可以看到，在分量3或4之后，因變量中解釋的方差幾乎沒有改善。下面是這些結(jié)果圖的代碼。?

我們將使用我們的模型進(jìn)行預(yù)測。?

此后，我們將計算均方誤差。這是通過從測試集的因變量中減去我們的預(yù)測模型的結(jié)果來完成的。然后，我們對這些信息求平方并計算平均值。?

mean((pls.pred-Mroz$income[test])^2)

## [1] 63386682

?我們將使用傳統(tǒng)的最小二乘回歸模型運行數(shù)據(jù)并比較結(jié)果。

## [1] 59432814

最小二乘模型比部分最小二乘模型好一點，但是如果看一下模型，我們會看到幾個不重要的變量。我們將刪除這些，看看結(jié)果如何

summary(lm.fit)

## ## Call: ## lm(formula = income ~ ., data = Mroz, subset = train) ## ## Residuals: ## ? ?Min ? ? 1Q Median ? ? 3Q ? ?Max ## -20131 ?-2923 ?-1065 ? 1670 ?36246 ## ## Coefficients: ## ? ? ? ? ? ? ? Estimate Std. Error t value Pr(>|t|) ? ? ## (Intercept) -1.946e+04 ?3.224e+03 ?-6.036 3.81e-09 *** ## workno ? ? ?-4.823e+03 ?1.037e+03 ?-4.651 4.59e-06 *** ## hoursw ? ? ? 4.255e+00 ?5.517e-01 ? 7.712 1.14e-13 *** ## child6 ? ? ?-6.313e+02 ?6.694e+02 ?-0.943 0.346258 ? ? ## child618 ? ? 4.847e+02 ?2.362e+02 ? 2.052 0.040841 * ? ## agew ? ? ? ? 2.782e+02 ?8.124e+01 ? 3.424 0.000686 *** ## educw ? ? ? ?1.268e+02 ?1.889e+02 ? 0.671 0.502513 ? ? ## hearnw ? ? ? 6.401e+02 ?1.420e+02 ? 4.507 8.79e-06 *** ## wagew ? ? ? ?1.945e+02 ?1.818e+02 ? 1.070 0.285187 ? ? ## hoursh ? ? ? 6.030e+00 ?5.342e-01 ?11.288 ?< 2e-16 *** ## ageh ? ? ? ?-9.433e+01 ?7.720e+01 ?-1.222 0.222488 ? ? ## educh ? ? ? ?1.784e+02 ?1.369e+02 ? 1.303 0.193437 ? ? ## wageh ? ? ? ?2.202e+03 ?8.714e+01 ?25.264 ?< 2e-16 *** ## educwm ? ? ?-4.394e+01 ?1.128e+02 ?-0.390 0.697024 ? ? ## educwf ? ? ? 1.392e+02 ?1.053e+02 ? 1.322 0.186873 ? ? ## unemprate ? -1.657e+02 ?9.780e+01 ?-1.694 0.091055 . ? ## cityyes ? ? -3.475e+02 ?6.686e+02 ?-0.520 0.603496 ? ? ## experience ?-1.229e+02 ?4.490e+01 ?-2.737 0.006488 ** ## --- ## Signif. codes: ?0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5668 on 374 degrees of freedom ## Multiple R-squared: ?0.7552, Adjusted R-squared: ?0.744 ## F-statistic: 67.85 on 17 and 374 DF, ?p-value: < 2.2e-16lm.pred<-predict(lm.fit,Mroz[test,]) mean((lm.pred-Mroz$income[test])^2)

## [1] 57839715

?誤差降低得更多，這表明最小二乘回歸模型優(yōu)于偏最小二乘模型。此外，偏最小二乘模型很難解釋。因此，最小二乘模型是最受歡迎的模型。

標(biāo)簽：