在Python中自然語言處理生成詞云WordCloud
原文鏈接:?http://tecdat.cn/?p=8585
?
?
了解如何在Python中使用WordCloud對自然語言處理執(zhí)行探索性數(shù)據(jù)分析。
什么是WordCloud?
?
很多時候,您可能會看到一片云,上面堆滿了許多大小不同的單詞,這些單詞代表了每個單詞的出現(xiàn)頻率或重要性。這稱為標(biāo)簽云或詞云。對于本教程,您將學(xué)習(xí)如何在Python中創(chuàng)建自己的WordCloud并根據(jù)需要自定義它。?
先決條件
該numpy
庫是最流行和最有用的庫之一,用于處理多維數(shù)組和矩陣。它還與Pandas
庫結(jié)合使用以執(zhí)行數(shù)據(jù)分析。
wordcloud
安裝可能有些棘手。如果您只需要它來繪制基本的wordcloud,則pip install wordcloud
或conda install -c conda-forge wordcloud
就足夠了。
git clone https://github.com/amueller/word_cloud.git
cd word_cloud
pip install .
資料集:
首先,您加載所有必需的庫:
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
% matplotlib inline
c:\intelpython3\lib\site-packages\matplotlib\__init__.py:
import warnings
warnings.filterwarnings("ignore")
加載數(shù)據(jù)框。請注意,index_col=0
我們沒有將行名(索引)作為單獨的列讀入。
# Load in the dataframe
df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)
# Looking at first 5 rows of the dataset
df.head()
?
?
得到打印輸出。
print("There are {} observations and {} features in this dataset. \n".format(df.shape[0],df.shape[1]))
print("There are {} types of wine in this dataset such as {}... \n".format(len(df.variety.unique()),
", ".join(df.variety.unique()[0:5])))
print("There are {} countries producing wine in this dataset such as {}... \n".format(len(df.country.unique()),
", ".join(df.country.unique()[0:5])))
There are 129971 observations and 13 features in this dataset.
There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir...
There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France...
df[["country", "description","points"]].head()
?國家描述點數(shù)0意大利香氣包括熱帶水果,掃帚,brimston ...871個葡萄牙這是成熟果香,柔滑的酒...872我們酸和活潑,酸橙果肉的味道和...873我們菠蘿皮,檸檬髓和橙花...874我們就像2012年以來的常規(guī)裝瓶一樣,這...87
使用groupby()
和計算摘要統(tǒng)計信息。
使用葡萄酒數(shù)據(jù)集,您可以按國家/地區(qū)分組并查看所有國家/地區(qū)的價格。
?
?
這將在所有44個國家/地區(qū)中選擇前5個最高平均分:
?
?點數(shù)
價錢
國家
??英國
91.581081
51.681159
印度
90.222222
13.333333
奧地利
90.101345
30.762772
德國
89.851732
42.257547
加拿大
89.369650
35.712598
您可以使用Pandas DataFrame和Matplotlib的plot方法按國家/地區(qū)對葡萄酒的數(shù)量進(jìn)行繪制。
plt.ylabel("Number of Wines")
plt.show()
?
?
在44個生產(chǎn)葡萄酒的國家中,美國的葡萄酒評論數(shù)據(jù)集中有50,000多種葡萄酒,是排名第二的國家的兩倍:法國-以其葡萄酒而聞名的國家。意大利還生產(chǎn)大量優(yōu)質(zhì)葡萄酒,有近20,000種葡萄酒可供審查。
數(shù)量超過質(zhì)量嗎?
現(xiàn)在,按照評分最高的葡萄酒查看所有44個國家/地區(qū)的地塊:
plt.ylabel("Highest point of Wines")
plt.show()
?
?
澳洲,美國,葡萄牙,意大利和法國都有100分的葡萄酒。如果您注意到,在數(shù)據(jù)集中生產(chǎn)的葡萄酒數(shù)量上,葡萄牙排名第5,澳大利亞排名第9,這兩個國家/地區(qū)的葡萄酒種類少于8000。
?
設(shè)置基本的WordCloud
使用任何函數(shù)之前,您可能要做的第一件事是檢出函數(shù)的文檔字符串,并查看所有必需和可選參數(shù)。為此,鍵入?function
并運行它以獲取所有信息。
?WordCloud
[1;31mInit signature:[0m [0mWordCloud[0m[1;33m([0m[0mfont_path[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mwidth[0m[1;33m=[0m[1;36m400[0m[1;33m,[0m [0mheight[0m[1;33m=[0m[1;36m200[0m[1;33m,[0m [0mmargin[0m[1;33m=[0m[1;36m2[0m[1;33m,[0m [0mranks_only[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mprefer_horizontal[0m[1;33m=[0m[1;36m0.9[0m[1;33m,[0m [0mmask[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mscale[0m[1;33m=[0m[1;36m1[0m[1;33m,[0m [0mcolor_func[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmax_words[0m[1;33m=[0m[1;36m200[0m[1;33m,[0m [0mmin_font_size[0m[1;33m=[0m[1;36m4[0m[1;33m,[0m [0mstopwords[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mbackground_color[0m[1;33m=[0m[1;34m'black'[0m[1;33m,[0m [0mmax_font_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mfont_step[0m[1;33m=[0m[1;36m1[0m[1;33m,[0m [0mmode[0m[1;33m=[0m[1;34m'RGB'[0m[1;33m,[0m [0mrelative_scaling[0m[1;33m=[0m[1;36m0.5[0m[1;33m,[0m [0mregexp[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mcollocations[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcolormap[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mnormalize_plurals[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcontour_width[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m [0mcontour_color[0m[1;33m=[0m[1;34m'black'[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Word cloud object for generating and drawing.
Parameters
----------
font_path : string
Font path to the font that will be used (OTF or TTF).
Defaults to DroidSansMono path on a Linux machine. If you are on
another OS or don't have this font; you need to adjust this path.
width : int (default=400)
Width of the canvas.
height : int (default=200)
Height of the canvas.
prefer_horizontal : float (default=0.90)
The ratio of times to try horizontal fitting as opposed to vertical.
If prefer_horizontal < 1, the algorithm will try rotating the word
if it doesn't fit. (There is currently no built-in way to get only
vertical words.)
mask : nd-array or None (default=None)
If not None, gives a binary mask on where to draw words. If mask is not
None, width and height will be ignored, and the shape of mask will be
used instead. All white (#FF or #FFFFFF) entries will be considered
"masked out" while other entries will be free to draw on. [This
changed in the most recent version!]
contour_width: float (default=0)
If mask is not None and contour_width > 0, draw the mask contour.
contour_color: color value (default="black")
Mask contour color.
scale : float (default=1)
Scaling between computation and drawing. For large word-cloud images,
using scale instead of larger canvas size is significantly faster, but
might lead to a coarser fit for the words.
min_font_size : int (default=4)
Smallest font size to use. Will stop when there is no more room in this
size.
font_step : int (default=1)
Step size for the font. font_step > 1 might speed up computation but
give a worse fit.
max_words : number (default=200)
The maximum number of words.
stopwords : set of strings or None
The words that will be eliminated. If None, the build-in STOPWORDS
list will be used.
background_color : color value (default="black")
Background color for the word cloud image.
max_font_size : int or None (default=None)
Maximum font size for the largest word. If None, the height of the image is
used.
mode : string (default="RGB")
Transparent background will be generated when mode is "RGBA" and
background_color is None.
relative_scaling : float (default=.5)
Importance of relative word frequencies for font-size. ?With
relative_scaling=0, only word-ranks are considered. ?With
relative_scaling=1, a word that is twice as frequent will have twice
the size. ?If you want to consider the word frequencies and not only
their rank, relative_scaling around .5 often looks good.
.. versionchanged: 2.0
Default is now 0.5.
color_func : callable, default=None
Callable with parameters word, font_size, position, orientation,
font_path, random_state that returns a PIL color for each word.
Overwrites "colormap".
See colormap for specifying a matplotlib colormap instead.
regexp : string or None (optional)
Regular expression to split the input text into tokens in process_text.
If None is specified, ``r"\w[\w']+"`` is used.
collocations : bool, default=True
Whether to include collocations (bigrams) of two words.
.. versionadded: 2.0
colormap : string or matplotlib colormap, default="viridis"
Matplotlib colormap to randomly draw colors from for each word.
Ignored if "color_func" is specified.
.. versionadded: 2.0
normalize_plurals : bool, default=True
Whether to remove trailing 's' from words. If True and a word
appears with and without a trailing 's', the one with trailing 's'
is removed and its counts are added to the version without
trailing 's' -- unless the word ends with 'ss'.
Attributes
----------
``words_`` : dict of string to float
Word tokens with associated frequency.
.. versionchanged: 2.0
``words_`` is now a dictionary
``layout_`` : list of tuples (string, int, (int, int), int, color))
Encodes the fitted word cloud. Encodes for each word the string, font
size, position, orientation, and color.
Notes
-----
Larger canvases will make the code significantly slower. If you need a
large word cloud, try a lower canvas size, and set the scale parameter.
The algorithm might give more weight to the ranking of the words
then their actual frequencies, depending on the ``max_font_size`` and the
scaling heuristic.
[1;31mFile:[0m ? ? ? ? ? c:\intelpython3\lib\site-packages\wordcloud\wordcloud.py
[1;31mType:[0m ? ? ? ? ? type
您可以看到WordCloud對象唯一需要的參數(shù)是text,而所有其他參數(shù)都是可選的。
因此,讓我們從一個簡單的示例開始:使用第一個觀察描述作為wordcloud的輸入。三個步驟是:
提取評論(文本文件)
創(chuàng)建并生成wordcloud圖像
使用matplotlib顯示云
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
?
?
您可以看到第一篇評論提到了很多關(guān)于葡萄酒的香氣。
現(xiàn)在,改變WordCloud像一些可選參數(shù)max_font_size
,max_word
和background_color
。
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
?
?
如果要保存圖像,WordCloud提供了一個功能?to_file
# Save the image in the img folder:
wordcloud.to_file("img/first_review.png")
<wordcloud.wordcloud.WordCloud at 0x16f1d704978>
將它們加載到其中時,結(jié)果將如下所示:
?
?
因此,現(xiàn)在您將所有葡萄酒評論合并為一個大文本,并創(chuàng)建一個巨大的胖云,以查看這些葡萄酒中最常見的特征。
?
print ("There are {} words in the combination of all review.".format(len(text)))
There are 31661073 words in the combination of all review.
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
?
?
哦,似乎黑櫻桃和濃郁的醇厚是最受歡迎的特征,而赤霞珠則是最受歡迎的特征。這與赤霞珠“是世界上最廣為人知的紅酒葡萄品種之一。
現(xiàn)在,讓我們將這些話倒入一杯葡萄酒中!
?
為了為您的wordcloud創(chuàng)建形狀,首先,您需要找到一個PNG文件以成為遮罩。以下是一個不錯的網(wǎng)站,可以在Internet上找到它:
?
?
為了確保遮罩能夠正常工作,讓我們以numpy數(shù)組形式對其進(jìn)行查看:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
?
首先,使用該transform_format()
函數(shù)將數(shù)字0交換為255。
def transform_format(val):
if val == 0:
return 255
else:
return val
然后,創(chuàng)建一個形狀與您現(xiàn)有的蒙版相同的新蒙版,并將該功能transform_format()
應(yīng)用于上一個蒙版的每一行中的每個值。
現(xiàn)在,您將以正確的形式創(chuàng)建一個新的蒙版。
array([[255, 255, 255, ..., 255, 255, 255],
[255, 255, 255, ..., 255, 255, 255],
[255, 255, 255, ..., 255, 255, 255],
...,
[255, 255, 255, ..., 255, 255, 255],
[255, 255, 255, ..., 255, 255, 255],
[255, 255, 255, ..., 255, 255, 255]])
好的!使用正確的蒙版,您可以開始使用選定的形狀制作wordcloud。
# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
?
?
創(chuàng)建了一個酒瓶形狀的wordcloud!似乎葡萄酒描述中最常提及的是黑櫻桃,水果風(fēng)味和葡萄酒的濃郁特性?,F(xiàn)在,讓我們仔細(xì)看看每個國家/地區(qū)的評論:
?
?
按照顏色圖案創(chuàng)建wordcloud
可以合并五個擁有最多葡萄酒的國家的所有評論。要查找這些國家/地區(qū),可以查看地塊國家/地區(qū)與上方的葡萄酒數(shù)量的關(guān)系,也可以使用上方的組來查找每個國家/地區(qū)(每個組)的觀察數(shù)量,并sort_values()
使用參數(shù)ascending=False
降序排列。
country
US ? ? ? ? ?54504
France ? ? ?22093
Italy ? ? ? 19540
Spain ? ? ? ?6645
Portugal ? ? 5691
dtype: int64
因此,現(xiàn)在您有5個熱門國家/地區(qū):美國,法國,意大利,西班牙和葡萄牙。
country
US ? ? ? ? ? 54504
France ? ? ? 22093
Italy ? ? ? ?19540
Spain ? ? ? ? 6645
Portugal ? ? ?5691
Chile ? ? ? ? 4472
Argentina ? ? 3800
Austria ? ? ? 3345
Australia ? ? 2329
Germany ? ? ? 2165
dtype: int64
目前,僅5個國家就足夠了。
要獲得每個國家/地區(qū)的所有評論,您可以使用" ".join(list)
語法將所有評論連接起來,該語法將所有元素合并在以空格分隔的列表中。
然后,如上所述創(chuàng)建wordcloud。
# store to file
plt.savefig("img/us_wine.png", format="png")
plt.show()
?
?
看起來不錯!現(xiàn)在,讓我們再重復(fù)一次法國的評論。
# store to file
plt.savefig("img/fra_wine.png", format="png")
#plt.show()
請注意,繪圖后應(yīng)保存圖像,以使單詞云具有所需的顏色模式。
?
?
# store to file
plt.savefig("img/ita_wine.png", format="png")
#plt.show()
?
?
繼意大利之后是西班牙:
# store to file
plt.savefig("img/spa_wine.png", format="png")
#plt.show()
?
最后,葡萄牙:
# store to file
plt.savefig("img/por_wine.png", format="png")
#plt.show()
?
?
最終結(jié)果在下表中。
?
?
?
?
?