《二階序列模型》Second order sequence model
來源:https://e2eml.school/transformers.html#softmax
中英雙語版,由各類翻譯程序和少量自己理解的意思做中文注釋
----------------------------------------------------------------------------------------------------------------------
Predicting the next word based on only the current word is hard.
僅根據(jù)當前單詞預測下一個單詞是很困難的。
That's like predicting the rest of a tune after being given just the first note.
這就像在只給出第一個音符后預測曲調(diào)的其余部分。
Our chances are a lot better if we can at least get two notes to go on.
如果我們至少能得到兩個音符輸入,那我們的預測就會好很多。
We can see how this works in another toy language model for our computer commands.
我們可以在計算機命令的另一個玩具語言模型中看到它是如何工作的。
We expect that this one will only ever see two sentences, in a 40/60 proportion.
我們預計這個只會看到兩個句子,比例為40/60。
??? Check whether the battery ran down please.(請檢查電池是否耗盡。)
??? Check whether the program ran please.(請檢查程序是否運行。)
A Markov chain illustrates a first order model for this.
用馬爾可夫鏈的一階模型。

Here we can see that if our model looked at the two most recent words, instead of just one, that it could do a better job.
在這里,我們可以看到,如果我們的模型查看最近的兩個單詞,而不僅僅是一個,它可以做得更好。
When it
encounters battery ran, it knows that the next word will be down, and when it sees program ran
the next word will be please.
當它遇到“battery ran”時,它知道下一個單詞是“down”,當它看到“program ran”時,下一個單詞會“please.”。
This eliminates one of the branches in the model, reducing uncertainty and increasing ? ? ? ? ?confidence.
這消除了模型中的一個分支,減少了不確定性并增加了置信度。
Looking back two words turns this into a second order Markov model.
回顧兩個詞,這變成了二階馬爾可夫模型。
It gives more context on which to base next word predictions.
它提供了更多上下文,作為下一個單詞預測的基礎。
Second order Markov chains are more challenging to draw, but here are the connections that demonstrate their value.?
二階馬爾可夫鏈的繪制更具挑戰(zhàn)性,但以下的連接可證明其價值。
? ??

To highlight the difference between the two, here is the first order transition matrix, ? ? ? ?
為了突出兩者之間的區(qū)別,這里是一階轉移矩陣,

and here is the second order transition matrix.
? ? ??
這是二階轉移矩陣。

?Notice how the second order matrix has a separate row for every combination of words (most of which are not shown here).
請注意,二階矩陣對于每個單詞組合都有單獨的行(此處未顯示其中大部分)。
That means that if we start with a vocabulary size of N then the transition matrix has N^2 rows.
這意味著如果我們從 N 的詞匯大小開始,那么轉換矩陣有 N^2 行。
What this buys us is more confidence.
這給我們帶來的是更多的置信度。
There are more ones and fewer fractions in the second order model.
在二階模型中有更多的1和更少的分數(shù)。
There's only one row with fractions in it, one branch in our model.
?只有一行包含分數(shù),在我們的模型中有一個分支。
Intuitively, looking at two words instead of just one gives more context, more information on which to base a next word guess.
直觀地說,看兩個單詞而不僅僅是一個單詞會給出更多的上下文和更多的信息,從而為下一個單詞的猜測提供依據(jù)。