拓端tecdat|R語言文本挖掘、情感分析和可視化哈利波特小說文本數(shù)據(jù)
原文鏈接:http://tecdat.cn/?p=22984
原文出處:拓端數(shù)據(jù)部落公眾號
一旦我們清理了我們的文本并進(jìn)行了一些基本的詞頻分析,下一步就是了解文本中的觀點或情感。這被認(rèn)為是情感分析,本教程將引導(dǎo)你通過一個簡單的方法來進(jìn)行情感分析。
簡而言之
本教程是對情感分析的一個介紹。本教程建立在tidy text教程的基礎(chǔ)上,所以如果你沒有讀過該教程,我建議你從那里開始。在本教程中,我包括以下內(nèi)容。
復(fù)制的要求:重現(xiàn)本教程中的分析需要什么?
情感數(shù)據(jù)集:用來對情感進(jìn)行評分的主要數(shù)據(jù)集
基本情感分析:執(zhí)行基本的情感分析
比較情感:比較情感庫中的情感差異
常見的情緒詞:找出最常見的積極和消極詞匯
大單元的情感分析:在較大的文本單元中分析情感,而不是單個詞。
復(fù)制要求
本教程利用了harrypotter文本數(shù)據(jù),以說明文本挖掘和分析能力。
library(tidyverse) # 數(shù)據(jù)處理和繪圖
library(stringr) # 文本清理和正則表達(dá)式
library(tidytext) # 提供額外的文本挖掘功能
我們正在處理的七部小說,包括
philosophers_stone:《哈利-波特與魔法石》(1997)。
chamber_of_secrets: 《哈利-波特與密室》(1998)
阿茲卡班的囚徒(prisoner_of_azkaban)。Harry Potter and the Prisoner of Azkaban (1999)
Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: 哈利-波特與混血王子(2005)
deathly_hallows: 哈利-波特與死亡圣器(2007)。
每個文本都在一個字符矢量中,每個元素代表一個章節(jié)。例如,下面說明了philosophers_stone的前兩章的原始文本。
philosophers_stone[1:2]
## [1] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank
## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold
## with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly
## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck,
## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a
## small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also
## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out
## about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated>
## [2] "THE VANISHING GLASS Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys'
## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen
## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago,
## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was
## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a
## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house,
## too. Yet Harry Potter was still there, asleep at the moment, but no... <truncated>
情感數(shù)據(jù)集
有各種各樣的字典存在,用于評估文本中的觀點或情感。tidytext包在sentiments數(shù)據(jù)集中包含了三個情感詞典。
sentiments
## # A tibble: 23,165 × 4
## ? ? ? ? ? word sentiment lexicon score
## ? ? ? ? ?<chr> ? ? <chr> ? <chr> <int>
## 1 ? ? ? abacus ? ? trust ? ? nrc ? ?NA
## 2 ? ? ?abandon ? ? ?fear ? ? nrc ? ?NA
## 3 ? ? ?abandon ?negative ? ? nrc ? ?NA
## 4 ? ? ?abandon ? sadness ? ? nrc ? ?NA
## 5 ? ?abandoned ? ? anger ? ? nrc ? ?NA
## 6 ? ?abandoned ? ? ?fear ? ? nrc ? ?NA
## 7 ? ?abandoned ?negative ? ? nrc ? ?NA
## 8 ? ?abandoned ? sadness ? ? nrc ? ?NA
## 9 ?abandonment ? ? anger ? ? nrc ? ?NA
## 10 abandonment ? ? ?fear ? ? nrc ? ?NA
## # ... with 23,155 more rows
這三個詞庫是
AFINN
?bing
?nrc
這三個詞庫都是基于單字(或單詞)的。這些詞庫包含了許多英語單詞,這些單詞被分配了積極/消極情緒的分?jǐn)?shù),也可能是快樂、憤怒、悲傷等情緒的分?jǐn)?shù)。nrc詞典以二元方式("是"/"否")將單詞分為積極、消極、憤怒、期待、厭惡、恐懼、快樂、悲傷、驚訝和信任等類別。bing詞庫以二元方式將單詞分為積極和消極類別。AFINN詞庫給單詞打分,分?jǐn)?shù)在-5到5之間,負(fù)分表示消極情緒,正分表示積極情緒。
?
# 查看單個詞庫
get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")
基本情感分析
為了進(jìn)行情感分析,我們需要將我們的數(shù)據(jù)整理成一個整齊的格式。下面將所有七本《哈利-波特》小說轉(zhuǎn)換為一個tibble,其中每個詞都按章節(jié)按書排列。更多細(xì)節(jié)請參見整潔文本教程。
#設(shè)定因素,按出版順序保存書籍
series$book <- factor(series$book, levels = rev(titles))
series
## # A tibble: 1,089,386 × 3
## ? ? ? ? ? ? ? ? ? book chapter ? ?word
## * ? ? ? ? ? ? ? <fctr> ? <int> ? <chr>
## 1 ?Philosopher's Stone ? ? ? 1 ? ? the
## 2 ?Philosopher's Stone ? ? ? 1 ? ? boy
## 3 ?Philosopher's Stone ? ? ? 1 ? ? who
## 4 ?Philosopher's Stone ? ? ? 1 ? lived
## 5 ?Philosopher's Stone ? ? ? 1 ? ? ?mr
## 6 ?Philosopher's Stone ? ? ? 1 ? ? and
## 7 ?Philosopher's Stone ? ? ? 1 ? ? mrs
## 8 ?Philosopher's Stone ? ? ? 1 dursley
## 9 ?Philosopher's Stone ? ? ? 1 ? ? ?of
## 10 Philosopher's Stone ? ? ? 1 ?number
## # ... with 1,089,376 more rows
現(xiàn)在讓我們使用nrc情感數(shù)據(jù)集來評估整個《哈利-波特》系列所代表的不同情感。我們可以看到,負(fù)面情緒的存在比正面情緒更強(qiáng)烈。
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
## # A tibble: 10 × 2
## ? ? ? sentiment ? ? n
## ? ? ? ? ? <chr> <int>
## 1 ? ? ?negative 56579
## 2 ? ? ?positive 38324
## 3 ? ? ? sadness 35866
## 4 ? ? ? ? anger 32750
## 5 ? ? ? ? trust 23485
## 6 ? ? ? ? ?fear 21544
## 7 ?anticipation 21123
## 8 ? ? ? ? ? joy 14298
## 9 ? ? ? disgust 13381
## 10 ? ? surprise 12991
這給出了一個很好的整體感覺,但如果我們想了解每部小說的過程中情緒是如何變化的呢?要做到這一點,我們要進(jìn)行以下工作。
創(chuàng)建一個索引,將每本書按500個詞分開;這是每兩頁的大致字?jǐn)?shù),所以這將使我們能夠評估情緒的變化,甚至是在章節(jié)中的變化。
用inner_join連接bing詞典,以評估每個詞的正面和負(fù)面情緒。
計算每兩頁有多少個正面和負(fù)面的詞
分散我們的數(shù)據(jù)
計算出凈情緒(正面-負(fù)面)。
繪制我們的數(shù)據(jù)
ggplot(aes(index, sentiment, fill = book)) +
geom_bar(alpha = 0.5")

現(xiàn)在我們可以看到每部小說的情節(jié)是如何在故事的發(fā)展軌跡中朝著更積極或更消極的情緒變化。
比較情感
有了情感詞典的幾種選擇,你可能想了解更多關(guān)于哪一種適合你的目的的信息。讓我們使用所有三種情感詞典,并檢查它們對每部小說的不同之處。
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")
bing_and_nrc <-
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative"))) %>%
spread(sentiment, n, fill = 0) %>%
我們現(xiàn)在有了對每個情感詞庫的小說文本中凈情感(正面-負(fù)面)的估計。讓我們把它們繪制出來。
ggplot(aes(index, sentiment, fill = method)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_grid(book ~ method)

計算情感的三種不同的詞典給出的結(jié)果在絕對意義上是不同的,但在小說中卻有相當(dāng)相似的相對軌跡。我們看到在小說中差不多相同的地方有類似的情緒低谷和高峰,但絕對值卻明顯不同。在某些情況下,AFINN詞典似乎比NRC詞典發(fā)現(xiàn)了更多積極的情緒。這個輸出結(jié)果也使我們能夠在不同的小說之間進(jìn)行比較。首先,你可以很好地了解書籍長度的差異--《菲尼克斯的秩序》比《哲學(xué)家的石頭》長很多。其次,你可以比較一個系列中的書籍在情感方面的不同。
常見情緒詞
同時擁有情感和單詞的數(shù)據(jù)框架的一個好處是,我們可以分析對每種情感有貢獻(xiàn)的單詞數(shù)。
word_counts
## # A tibble: 3,313 × 3
## ? ? ?word sentiment ? ? n
## ? ? <chr> ? ? <chr> <int>
## 1 ? ?like ?positive ?2416
## 2 ? ?well ?positive ?1969
## 3 ? right ?positive ?1643
## 4 ? ?good ?positive ?1065
## 5 ? ?dark ?negative ?1034
## 6 ? great ?positive ? 877
## 7 ? death ?negative ? 757
## 8 ? magic ?positive ? 606
## 9 ?better ?positive ? 533
## 10 enough ?positive ? 509
## # ... with 3,303 more rows
我們可以直觀地查看,以評估每種情緒的前n個詞。
ggplot(aes(reorder(word, n), n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity"

較大單位的情緒分析
很多有用的工作可以通過在詞的層面上進(jìn)行標(biāo)記化來完成,但有時查看不同的文本單位是有用的或必要的。例如,一些情感分析算法不僅僅關(guān)注單字(即單個單詞),而是試圖了解一個句子的整體情感。這些算法試圖理解
我今天過的不開心。
是一個悲傷的句子,而不是一個快樂的句子,因為有否定詞。斯坦福大學(xué)的CoreNLP工具是這類情感分析算法的例子。對于這些,我們可能想把文本標(biāo)記為句子。我使用philosophers_stone數(shù)據(jù)集來說明。
tibble(text = philosophers_stone)
## ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sentence
## ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?<chr>
## 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?the boy who lived ?mr. and mrs.
## 2 ?dursley, of number four, privet drive, were proud to say that they were per
## 3 ?they were the last people you'd expect to be involved in anything strange o
## 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?mr.
## 5 ? ? ?dursley was the director of a firm called grunnings, which made drills.
## 6 ?he was a big, beefy man with hardly any neck, although he did have a very l
## 7 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mrs.
## 8 ?dursley was thin and blonde and had nearly twice the usual amount of neck,
## 9 ?the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows
參數(shù)token = "句子 "試圖通過標(biāo)點符號來分割文本。
讓我們繼續(xù)按章節(jié)和句子來分解philosophers_stone文本。
text = philosophers_stone) %>%
unnest_tokens(sentence, text, token = "sentences")
這將使我們能夠按章節(jié)和句子來評估凈情緒。首先,我們需要追蹤句子的編號,然后我創(chuàng)建一個索引,追蹤每一章的進(jìn)度。然后,我按字?jǐn)?shù)對句子進(jìn)行解嵌。這就給了我們一個tibble,其中有每一章中按句子分列的單個詞?,F(xiàn)在,像以前一樣,我加入AFINN詞典,并計算每一章的凈情感分?jǐn)?shù)。我們可以看到,最積極的句子是第9章的一半,第17章的末尾,第4章的早期,等等。
group_by(chapter, index) %>%
summarise(sentiment = sum(score, na.rm = TRUE)) %>%
arrange(desc(sentiment))
## Source: local data frame [1,401 x 3]
## Groups: chapter [17]
##
## ? ?chapter index sentiment
## ? ? ?<int> <dbl> ? ? <int>
## 1 ? ? ? ?9 ?0.47 ? ? ? ?14
## 2 ? ? ? 17 ?0.91 ? ? ? ?13
## 3 ? ? ? ?4 ?0.11 ? ? ? ?12
## 4 ? ? ? 12 ?0.45 ? ? ? ?12
## 5 ? ? ? 17 ?0.54 ? ? ? ?12
## 6 ? ? ? ?1 ?0.25 ? ? ? ?11
## 7 ? ? ? 10 ?0.04 ? ? ? ?11
## 8 ? ? ? 10 ?0.16 ? ? ? ?11
## 9 ? ? ? 11 ?0.48 ? ? ? ?11
## 10 ? ? ?12 ?0.70 ? ? ? ?11
## # ... with 1,391 more rows
我們可以用一個熱圖來形象地說明這一點,該熱圖顯示了我們在每一章的進(jìn)展中最積極和最消極的情緒。
ggplot(book_sent) +
geom_tile(color = "white") +
?


最受歡迎的見解
1.探析大數(shù)據(jù)期刊文章研究熱點
2.618網(wǎng)購數(shù)據(jù)盤點-剁手族在關(guān)注什么
3.r語言文本挖掘tf-idf主題建模,情感分析n-gram建模研究
4.python主題建模可視化lda和t-sne交互式可視化
5.疫情下的新聞數(shù)據(jù)觀察
6.python主題lda建模和t-sne可視化
7.r語言中對文本數(shù)據(jù)進(jìn)行主題模型topic-modeling分析
8.主題模型:數(shù)據(jù)聆聽人民網(wǎng)留言板的那些“網(wǎng)事”
9.python爬蟲進(jìn)行web抓取lda主題語義數(shù)據(jù)分析