散文網(wǎng) » 生活 »日常 » 拓端tecdat|R語(yǔ)言聚類有效性：確定最優(yōu)聚類數(shù)分析IRIS鳶尾花數(shù)據(jù)和可視化

拓端tecdat|R語(yǔ)言聚類有效性：確定最優(yōu)聚類數(shù)分析IRIS鳶尾花數(shù)據(jù)和可視化

2021-08-02 11:49 作者:拓端tecdat 0人讀過(guò) | 我要投稿

原文鏈接：http://tecdat.cn/?p=22879

原文出處：拓端數(shù)據(jù)部落公眾號(hào)

數(shù)據(jù)集概述

這個(gè)數(shù)據(jù)集常用于數(shù)據(jù)概述、可視化和聚類模型。它包括三個(gè)鳶尾花品種，每個(gè)品種有50個(gè)樣本，以及一些屬性。其中一個(gè)花種與其他兩個(gè)花種是線性可分離的，但其他兩個(gè)花種之間不是線性可分離的。

這個(gè)數(shù)據(jù)集的給定列是:

i> Id
ii> 萼片長(zhǎng)度(Cm)
iii>萼片寬度(Cm)
iv> 花瓣長(zhǎng)度(Cm)
v> 花瓣寬度 (Cm)
vi> 品種

讓我們把這個(gè)數(shù)據(jù)集可視化，并用kmeans進(jìn)行聚類。

基本可視化

IRIS數(shù)據(jù)，聚類前的基本可視化

plot(data, aes(x , y ))

plot(data,geom_density(alpha=0.25)

火山圖

plot( iris, stat_density(aes(ymax = ..density.., ?ymin = -..density..,

plot(data, aes(x ),stat_density= ..density.., ?facet_grid. ~ Species)

聚類數(shù)據(jù) :: 方法-1

# 在一個(gè)循環(huán)中進(jìn)行15次的kmeans聚類分析
for (i in 1:15)
kmeans(Data, i)
totalwSS[i]<-tot
# 聚類碎石圖 - 使用plot函數(shù)繪制total_wss與no-of-clusters的數(shù)值。
plot(x=1:15, ? ? ? ? ? ? ? ? ? ? ? ? # x= 類數(shù)量, 1 to 15
totalwSS, ? ? ? ? ? ? ? ? ? ? ?#每個(gè)類的total_wss值
type="b" ? ? ? ? ? ? ? ? ? ? ? # 繪制兩點(diǎn)，并將它們連接起來(lái)

聚類數(shù)據(jù) :: 方法-2

使用聚類有效性測(cè)量指標(biāo)

library(NbClust)
# 設(shè)置邊距為: c(bottom, left, top, right)
par(mar = c(2,2,2,2))
# 根據(jù)一些指標(biāo)來(lái)衡量聚類的合適性。
# 默認(rèn)情況下，它檢查從2個(gè)聚類到15個(gè)聚類的情況 # 花費(fèi)時(shí)間

休伯特指數(shù)

休伯特指數(shù)是一種確定聚類數(shù)量的圖形方法。
在休伯特指數(shù)圖中，我們尋找一個(gè)明顯的拐點(diǎn)，對(duì)應(yīng)于測(cè)量值的明顯增加，即休伯特指數(shù)第二差值圖中的明顯峰值。
?

D指數(shù)

在D指數(shù)的圖表中，我們尋找一個(gè)重要的拐點(diǎn)（D指數(shù)第二差值圖中的重要峰值），對(duì)應(yīng)于測(cè)量值的顯著增加。?

##
## *******************************************************************
## * 在所有指數(shù)中:
## * 10 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ? ? ? ? ? ? ? ? ? ?***** 結(jié)論*****
##
## * 根據(jù)多數(shù)規(guī)則，集群的最佳數(shù)量是2
##
##
## *******************************************************************

畫一個(gè)直方圖，表示各種指數(shù)對(duì)聚類數(shù)量的投票情況。
在26個(gè)指數(shù)中，大多數(shù)（10個(gè)）投票給2個(gè)聚類，8個(gè)投票給3個(gè)聚類，其余8個(gè)（26-10-8）投票給其他數(shù)量的聚類。
直方圖，斷點(diǎn)=15，因?yàn)槲覀兊乃惴ㄊ菣z查2到15個(gè)聚類的。?

hist(Best.nc)

聚類數(shù)據(jù) :: 方法-3

卡林斯基指標(biāo)類似于尋找群組間方差/群組內(nèi)方差的比率。

KM(Data, 1, 10) ?# 對(duì)聚類1至10的測(cè)試
# sortg = TRUE：將iris對(duì)象（行）作為其組別成員的函數(shù)排序
# 在熱圖中用顏色表示組成員類
# 排序是為了產(chǎn)生一個(gè)更容易解釋的圖表。
# 兩個(gè)圖。一個(gè)是熱圖，另一個(gè)是聚類數(shù)目與值（=BC/WC）。

modelData$results[2,] ? # 針對(duì)BC/WC值的聚類

# 那么，這些數(shù)值中哪一個(gè)是最大的？BC/WC應(yīng)盡可能的大
which.max(modelData$results[2,])

用Silhoutte圖對(duì)數(shù)據(jù)進(jìn)行聚類 :: 方法-4

先試著2個(gè)類

# 計(jì)算并返回通過(guò)使用歐氏距離測(cè)量法計(jì)算的距離矩陣，計(jì)算數(shù)據(jù)矩陣中各行之間的距離。
# 獲取silhoutte 系數(shù)
silhouette (cluster, dis)

試用8個(gè)聚類

# 計(jì)算并返回通過(guò)使用歐氏距離測(cè)量法計(jì)算的距離矩陣，計(jì)算數(shù)據(jù)矩陣中各行之間的距離。
# 獲取silhoutte 系數(shù)
silhouette (cluster, dis)

分析聚類趨勢(shì)

計(jì)算iris和隨機(jī)數(shù)據(jù)集的霍普金統(tǒng)計(jì)值

# 1. 給定一個(gè)數(shù)字向量或數(shù)據(jù)框架的一列根據(jù)其最小值和最大值生成統(tǒng)一的隨機(jī)數(shù)
runif(length(x), min(x), (max(x)))
# 2. ?通過(guò)在每一列上應(yīng)用函數(shù)生成隨機(jī)數(shù)據(jù)
apply(iris[,-5], 2, genx)
# 3. 將兩個(gè)數(shù)據(jù)集標(biāo)準(zhǔn)化
scale(iris) ? # 默認(rèn), center = T, scale = T
# 4. 計(jì)算數(shù)據(jù)集的霍普金斯統(tǒng)計(jì)數(shù)字
hopkins_stat

# 也可以用函數(shù)hopkins()計(jì)算。
hopkins(iris)

# 5. 計(jì)算隨機(jī)數(shù)據(jù)集的霍普金斯統(tǒng)計(jì)量
hopkins_stat

最受歡迎的見(jiàn)解

1.R語(yǔ)言k-Shape算法股票價(jià)格時(shí)間序列聚類

2.R語(yǔ)言中不同類型的聚類方法比較

3.R語(yǔ)言對(duì)用電負(fù)荷時(shí)間序列數(shù)據(jù)進(jìn)行K-medoids聚類建模和GAM回歸

4.r語(yǔ)言鳶尾花iris數(shù)據(jù)集的層次聚類

5.Python Monte Carlo K-Means聚類實(shí)戰(zhàn)

6.用R進(jìn)行網(wǎng)站評(píng)論文本挖掘聚類

7.用于NLP的Python：使用Keras的多標(biāo)簽文本LSTM神經(jīng)網(wǎng)絡(luò)

8.R語(yǔ)言對(duì)MNIST數(shù)據(jù)集分析探索手寫數(shù)字分類數(shù)據(jù)

9.R語(yǔ)言基于Keras的小數(shù)據(jù)集深度學(xué)習(xí)圖像分類

標(biāo)簽：