手機(jī)站首頁散文詩歌雜文隨筆日記小小說

散文網(wǎng) » 生活 »日常 » python主題建?？梢暬疞DA和T-SNE交互式可視化|附代碼數(shù)據(jù)

python主題建?？梢暬疞DA和T-SNE交互式可視化|附代碼數(shù)據(jù)

2023-07-25 12:43 作者:拓端tecdat 0人讀過 | 我要投稿

全文下載鏈接：http://tecdat.cn/?p=6917

我嘗試使用Latent?Dirichlet分配LDA來提取一些主題。本教程以自然語言處理流程為特色，從原始數(shù)據(jù)開始，準(zhǔn)備，建模，可視化論文。

我們將涉及以下幾點(diǎn)

使用LDA進(jìn)行主題建模
使用pyLDAvis可視化主題模型
使用t-SNE可視化LDA結(jié)果

In?[1]:

from?scipy?import?sparse?as?sp

Populating?the?interactive?namespace?from?numpy?and?matplotlib

In?[2]:

docs?=?array(p_df\['PaperText'\])

?預(yù)處理和矢量化文檔

In?[3]:

from?nltk.stem.wordnet?import?WordNetLemmatizerfrom?nltk.tokenize?import?RegexpTokenizerdef?docs_preprocessor(docs):????tokenizer?=?RegexpTokenizer(r'\\w+')????for?idx?in?range(len(docs)):????????docs\[idx\]?=?docs\[idx\].lower()??#?Convert?to?lowercase.????????docs\[idx\]?=?tokenizer.tokenize(docs\[idx\])??#?Split?into?words.????#?刪除數(shù)字，但不要?jiǎng)h除包含數(shù)字的單詞。????docs?=?\[\[token?for?token?in?doc?if?not?token.isdigit()\]?for?doc?in?docs\]????????#?刪除僅一個(gè)字符的單詞。????docs?=?\[\[token?for?token?in?doc?if?len(token)?>?3\]?for?doc?in?docs\]????????#?使文檔中的所有單詞規(guī)則化????lemmatizer?=?WordNetLemmatizer()????docs?=?\[\[lemmatizer.lemmatize(token)?for?token?in?doc\]?for?doc?in?docs\]??????return?docs

In?[4]:

docs?=?docs_preprocessor(docs)

?計(jì)算雙字母組/三元組：

主題非常相似，可以區(qū)分它們是短語而不是單個(gè)單詞。

In?[5]:

from?gensim.models?import?Phrases#?向文檔中添加雙字母組和三字母組（僅出現(xiàn)10次或以上的文檔）。bigram?=?Phrases(docs,?min_count=10)trigram?=?Phrases(bigram\[docs\])for?idx?in?range(len(docs)):????for?token?in?bigram\[docs\[idx\]\]:????????if?'_'?in?token:????????????#?Token?is?a?bigram,?add?to?document.????????????docs\[idx\].append(token)????for?token?in?trigram\[docs\[idx\]\]:????????if?'_'?in?token:????????????#?token是一個(gè)二元組，添加到文檔中。????????????docs\[idx\].append(token)

Using?TensorFlow?backend./opt/conda/lib/python3.6/site-packages/gensim/models/phrases.py:316:?UserWarning:?For?a?faster?implementation,?use?the?gensim.models.phrases.Phraser?class??warnings.warn("For?a?faster?implementation,?use?the?gensim.models.phrases.Phraser?class")

刪除

In?[6]:

from?gensim.corpora?import?Dictionary#?創(chuàng)建文檔的字典表示dictionary?=?Dictionary(docs)print('Number?of?unique?words?in?initital?documents:',?len(dictionary))#?過濾掉少于10個(gè)文檔或占文檔20％以上的單詞。dictionary.filter\_extremes(no\_below=10,?no_above=0.2)print('Number?of?unique?words?after?removing?rare?and?common?words:',?len(dictionary))

Number?of?unique?words?in?initital?documents:?39534Number?of?unique?words?after?removing?rare?and?common?words:?6001

清理常見和罕見的單詞，我們最終只有大約6％的詞。

矢量化數(shù)據(jù)：
第一步是獲得每個(gè)文檔的單詞表示。

In?[7]:

corpus?=?\[dictionary.doc2bow(doc)?for?doc?in?docs\]

In?[8]:

print('Number?of?unique?tokens:?%d'?%?len(dictionary))print('Number?of?documents:?%d'?%?len(corpus))

Number?of?unique?tokens:?6001Number?of?documents:?403

通過詞袋語料庫，我們可以繼續(xù)從文檔中學(xué)習(xí)我們的主題模型。

訓(xùn)練LDA模型?

In?[9]:

from?gensim.models?import?LdaModel

In?[10]:

%time?model?=?LdaModel(corpus=corpus,?id2word=id2word,?chunksize=chunksize,?\???????????????????????alpha='auto',?eta='auto',?\???????????????????????iterations=iterations,?num\_topics=num\_topics,?\???????????????????????passes=passes,?eval\_every=eval\_every)

CPU?times:?user?3min?58s,?sys:?348?ms,?total:?3min?58sWall?time:?3min?59s

如何選擇主題數(shù)量？

LDA是一種無監(jiān)督的技術(shù)，這意味著我們?cè)谶\(yùn)行模型之前不知道在我們的語料庫中有多少主題存在。主題連貫性是用于確定主題數(shù)量的主要技術(shù)之一。?

但是，我使用了LDA可視化工具pyLDAvis，嘗試了幾個(gè)主題并比較了結(jié)果。四個(gè)似乎是最能分離主題的最佳主題數(shù)量。

In?[11]:

import?pyLDAvis.gensimpyLDAvis.enable_notebook()import?warningswarnings.filterwarnings("ignore",?category=DeprecationWarning)

In?[12]:

pyLDAvis.gensim.prepare(model,?corpus,?dictionary)

Out[12]:

我們?cè)谶@看到什么？

左側(cè)面板，標(biāo)記為Intertopic Distance Map，圓圈表示不同的主題以及它們之間的距離。類似的主題看起來更近，而不同的主題更遠(yuǎn)。圖中主題圓的相對(duì)大小對(duì)應(yīng)于語料庫中主題的相對(duì)頻率。

如何評(píng)估我們的模型？?

將每個(gè)文檔分成兩部分，看看分配給它們的主題是否類似。 =>越相似越好

將隨機(jī)選擇的文檔相互比較。 =>越不相似越好

In?[13]:

from?sklearn.metrics.pairwise?import?cosine_similarityp_df\['tokenz'\]?=?docsdocs1?=?p_df\['tokenz'\].apply(lambda?l:?l\[:int0(len(l)/2)\])docs2?=?p_df\['tokenz'\].apply(lambda?l:?l\[int0(len(l)/2):\])

點(diǎn)擊標(biāo)題查閱往期內(nèi)容

【視頻】文本挖掘：主題模型（LDA）及R語言實(shí)現(xiàn)分析游記數(shù)據(jù)

左右滑動(dòng)查看更多

01

02

03

04

轉(zhuǎn)換數(shù)據(jù)

In?[14]:

corpus1?=?\[dictionary.doc2bow(doc)?for?doc?in?docs1\]corpus2?=?\[dictionary.doc2bow(doc)?for?doc?in?docs2\]#?使用語料庫LDA模型轉(zhuǎn)換lda_corpus1?=?model\[corpus1\]lda_corpus2?=?model\[corpus2\]

In?[15]:

from?collections?import?OrderedDictdef?get\_doc\_topic_dist(model,?corpus,?kwords=False):????????'''LDA轉(zhuǎn)換，對(duì)于每個(gè)文檔，僅返回權(quán)重非零的主題此函數(shù)對(duì)主題空間中的文檔進(jìn)行矩陣轉(zhuǎn)換????'''????top_dist?=\[\]????keys?=?\[\]????for?d?in?corpus:????????tmp?=?{i:0?for?i?in?range(num_topics)}????????tmp.update(dict(model\[d\]))????????vals?=?list(OrderedDict(tmp).values())????????top_dist?+=?\[array(vals)\]????????if?kwords:????????????keys?+=?\[array(vals).argmax()\]????return?array(top_dist),?keys

Intra?similarity:?cosine?similarity?for?corresponding?parts?of?a?doc(higher?is?better):0.906086532099Inter?similarity:?cosine?similarity?between?random?parts?(lower?is?better):0.846485334252

?讓我們看一下每個(gè)主題中出現(xiàn)的單詞。

In?[17]:

def?explore\_topic(lda\_model,?topic_number,?topn,?output=True):????"""輸出topn詞的列表????"""????terms?=?\[\]????for?term,?frequency?in?lda\_model.show\_topic(topic_number,?topn=topn):????????terms?+=?\[term\]????????if?output:????????????print(u'{:20}?{:.3f}'.format(term,?round(frequency,?3)))????????return?terms

In?[18]:

term?????????????????frequencyTopic?0?|---------------------data_set?????????????0.006embedding????????????0.004query????????????????0.004document?????????????0.003tensor???????????????0.003multi_label??????????0.003graphical_model??????0.003singular_value???????0.003topic_model??????????0.003margin???????????????0.003Topic?1?|---------------------policy???????????????0.007regret???????????????0.007bandit???????????????0.006reward???????????????0.006active_learning??????0.005agent????????????????0.005vertex???????????????0.005item?????????????????0.005reward_function??????0.005submodular???????????0.004Topic?2?|---------------------convolutional????????0.005generative_model?????0.005variational_inference?0.005recurrent????????????0.004gaussian_process?????0.004fully_connected??????0.004recurrent_neural?????0.004hidden_unit??????????0.004deep_learning????????0.004hidden_layer?????????0.004Topic?3?|---------------------convergence_rate?????0.007step_size????????????0.006matrix_completion????0.006rank_matrix??????????0.005gradient_descent?????0.005regret???????????????0.004sample_complexity????0.004strongly_convex??????0.004line_search??????????0.003sample_size??????????0.003

從上面可以檢查每個(gè)主題并為其分配一個(gè)可解釋的標(biāo)簽。在這里我將它們標(biāo)記如下：

In?[19]:

top_labels?=?{0:?'Statistics',?1:'Numerical?Analysis',?2:'Online?Learning',?3:'Deep?Learning'}

In?[20]:

??'''????#?1.刪除非字母????paper_text?=?re.sub("\[^a-zA-Z\]","?",?paper)????#?2.將單詞轉(zhuǎn)換為小寫并拆分????words?=?paper_text.lower().split()????#?3.?刪除停用詞????words?=?\[w?for?w?in?words?if?not?w?in?stops\]????#?4.?刪除短詞????words?=?\[t?for?t?in?words?if?len(t)?>?2\]????#?5.?形容詞????words?=?\[nltk.stem.WordNetLemmatizer().lemmatize(t)?for?t?in?words\]

In?\[21\]:

from?sklearn.feature_extraction.text?import?TfidfVectorizertvectorizer?=?TfidfVectorizer(input='content',?analyzer?=?'word',?lowercase=True,?stop_words='english',\??????????????????????????????????tokenizer=paper\_to\_wordlist,?ngram\_range=(1,?3),?min\_df=40,?max_df=0.20,\??????????????????????????????????norm='l2',?use\_idf=True,?smooth\_idf=True,?sublinear_tf=True)dtm?=?tvectorizer.fit\_transform(p\_df\['PaperText'\]).toarray()

In?[22]:

top_dist?=\[\]for?d?in?corpus:????tmp?=?{i:0?for?i?in?range(num_topics)}????tmp.update(dict(model\[d\]))????vals?=?list(OrderedDict(tmp).values())????top_dist?+=?\[array(vals)\]

In?[23]:

top\_dist,?lda\_keys=?get\_doc\_topic_dist(model,?corpus,?True)features?=?tvectorizer.get\_feature\_names()

In?[24]:

top_ws?=?\[\]for?n?in?range(len(dtm)):????inds?=?int0(argsort(dtm\[n\])\[::-1\]\[:4\])????tmp?=?\[features\[i\]?for?i?in?inds\]????????top_ws?+=?\['?'.join(tmp)\]????cluster_colors?=?{0:?'blue',?1:?'green',?2:?'yellow',?3:?'red',?4:?'skyblue',?5:'salmon',?6:'orange',?7:'maroon',?8:'crimson',?9:'black',?10:'gray'}p\_df\['colors'\]?=?p\_df\['clusters'\].apply(lambda?l:?cluster_colors\[l\])

In?[25]:

from?sklearn.manifold?import?TSNEtsne?=?TSNE(n_components=2)X\_tsne?=?tsne.fit\_transform(top_dist)

In?[26]:

p\_df\['X\_tsne'\]?=X_tsne\[:,?0\]p\_df\['Y\_tsne'\]?=X_tsne\[:,?1\]

In?[27]:

from?bokeh.plotting?import?figure,?show,?output_notebook,?save#輸出文件from?bokeh.models?import?HoverTool,?value,?LabelSet,?Legend,?ColumnDataSourceoutput_notebook()

BokehJS 0.12.5成功加載。

In?[28]:

source?=?ColumnDataSource(dict(????x=p\_df\['X\_tsne'\],????y=p\_df\['Y\_tsne'\],????color=p_df\['colors'\],????label=p\_df\['clusters'\].apply(lambda?l:?top\_labels\[l\]),#?????msize=?p\_df\['marker\_size'\],????topic\_key=?p\_df\['clusters'\],????title=?p_df\[u'Title'\],????content?=?p\_df\['Text\_Rep'\]))

In?[29]:

title?=?'T-SNE?visualization?of?topics'plot_lda.scatter(x='x',?y='y',?legend='label',?source=source,?????????????????color='color',?alpha=0.8,?size=10)#'msize',?)show(plot_lda)

點(diǎn)擊文末?“閱讀原文”

獲取全文完整代碼數(shù)據(jù)資料。

本文選自《python主題建?？梢暬疞DA和T-SNE交互式可視化》。

點(diǎn)擊標(biāo)題查閱往期內(nèi)容

【視頻】文本挖掘：主題模型（LDA）及R語言實(shí)現(xiàn)分析游記數(shù)據(jù)

NLP自然語言處理—主題模型LDA案例：挖掘人民網(wǎng)留言板文本數(shù)據(jù)

Python主題建模LDA模型、t-SNE 降維聚類、詞云可視化文本挖掘新聞組數(shù)據(jù)集

自然語言處理NLP：主題LDA、情感分析疫情下的新聞文本數(shù)據(jù)

R語言對(duì)NASA元數(shù)據(jù)進(jìn)行文本挖掘的主題建模分析

R語言文本挖掘、情感分析和可視化哈利波特小說文本數(shù)據(jù)

Python、R對(duì)小說進(jìn)行文本挖掘和層次聚類可視化分析案例

用于NLP的Python：使用Keras進(jìn)行深度學(xué)習(xí)文本生成

長(zhǎng)短期記憶網(wǎng)絡(luò)LSTM在時(shí)間序列預(yù)測(cè)和文本分類中的應(yīng)用

用Rapidminer做文本挖掘的應(yīng)用：情感分析

R語言文本挖掘tf-idf,主題建模，情感分析,n-gram建模研究

R語言對(duì)推特twitter數(shù)據(jù)進(jìn)行文本情感分析

Python使用神經(jīng)網(wǎng)絡(luò)進(jìn)行簡(jiǎn)單文本分類

用于NLP的Python：使用Keras的多標(biāo)簽文本LSTM神經(jīng)網(wǎng)絡(luò)分類

R語言文本挖掘使用tf-idf分析NASA元數(shù)據(jù)的關(guān)鍵字

R語言NLP案例：LDA主題文本挖掘優(yōu)惠券推薦網(wǎng)站數(shù)據(jù)

Python使用神經(jīng)網(wǎng)絡(luò)進(jìn)行簡(jiǎn)單文本分類

R語言自然語言處理（NLP）：情感分析新聞文本數(shù)據(jù)

Python、R對(duì)小說進(jìn)行文本挖掘和層次聚類可視化分析案例

R語言對(duì)推特twitter數(shù)據(jù)進(jìn)行文本情感分析

R語言中的LDA模型：對(duì)文本數(shù)據(jù)進(jìn)行主題模型topic modeling分析

R語言文本主題模型之潛在語義分析（LDA:Latent Dirichlet Allocation）

標(biāo)簽：