python安娜卡列妮娜詞云圖制作
原文鏈接:http://tecdat.cn/?p=6852
知識點普及
?
詞頻:某個詞在該文檔中出現(xiàn)的次數(shù)停用詞:數(shù)據(jù)處理時過濾掉某些字或詞,如:網(wǎng)站、的等語料庫:也就是我們要分析的所有文檔的集合中文分詞:將漢字序列分成一個個單獨的詞
使用第三方庫介紹
jieba jieba.cut(content) content 為分詞的句子pandas pandas.DataFrame()生成DataFrame對象 pandas.DataFrame.groupby()分組統(tǒng)計 分組統(tǒng)計實例 pandas.DataFrame.groupby(by=列名數(shù)組)[統(tǒng)計列名數(shù)組].agg({ 統(tǒng)計項名稱:統(tǒng)計函數(shù)})wordcloudpython構(gòu)建詞云的庫文件 安裝方式請自行案例
詞云實現(xiàn)
# coding=utf-8import osimport jiebaimport codecsimport pandas as pdimport numpy as npfrom wordcloud import WordCloud,ImageColorGeneratorimport matplotlib.pyplot as plt
#導(dǎo)入所用庫文件basefile = data存儲路徑
# 語料庫加載
f_in = codecs.open(basefile+'an.txt','r','utf-8') content = f_in.read()
#分詞,生成segments列表segments = []
segs = jieba.cut(content)for seg in segs: if len(seg)>1: segments.append(seg)
#生成DataFrame對象segmentDF = pd.DataFrame({'segment':segments})
#分組統(tǒng)計segStat = segmentDF.groupby( by = ['segment'] )['segment'].agg({ '計數(shù)':np.size}).reset_index().sort_values(by = ['計數(shù)'], ascending = False )
#加載停用詞 stopwords = pd.read_csv( "./StopwordsCN.txt", encoding='utf8', index_col=False)
#移除停用詞,并做去反操作fSegStat = segStat[ ~segStat.segment.isin(stopwords.stopword)]
#構(gòu)建詞云文件wordcloud = WordCloud( font_path='./simhei.ttf',
#詞云展示字體 background_color="black",
#詞云展示背景顏色
)
words = fSegStat.set_index('segment').to_dict()wordcloud.fit_words(words['計數(shù)'])plt.imshow(wordcloud)plt.show()
?
效果展示

?
?
AnnaKarenina
詞云美化
from scipy.misc import imread
#讀取圖片背景
bimg = imread(basefile+'An.png')
wordcloud = WordCloud( background_color="white", mask=bimg, font_path='./simhei.ttf')wordcloud = wordcloud.fit_words(words['計數(shù)'])
#設(shè)置圖片大小
plt.figure( num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
#獲取圖片顏色
bimgColors = ImageColorGenerator(bimg)plt.axis("off")
#重置詞云顏色
plt.imshow(wordcloud.recolor(color_func=bimgColors))plt.show()
▍需要幫助?聯(lián)系我們