《帶跳過的二階序列模型 》Second order sequence model with skips
來源:https://e2eml.school/transformers.html#softmax
中英雙語版,由各類翻譯程序和少量自己理解的意思做中文注釋
相關(guān)文章匯總在文集:Transformers from Scratch(中文注釋)
--------------------------------------------------------------------------------------------------------------------
?A second order model works well when we only have to look back two words to decide what word comes next.
當(dāng)我們只需要回顧兩個單詞來決定下一個單詞時,二階模型效果很好。
What about when we have to look back further?
當(dāng)我們不得不進一步回顧時呢?
Imagine we are building yet another language model.
想象一下,我們正在構(gòu)建另一種語言模型。
This one only has to represent two sentences, each equally likely to occur.
這個只需要代表兩個句子,每個句子發(fā)生的可能性相等。
??? Check the program log and find out whether it ran please.
(請檢查程序日志,了解它是否運行。)
??? Check the battery log and find out whether it ran down please.
(請檢查電池日志,看看它是否耗盡了。)
In this example, in order to determine which word should come after ran, we would have to look back 8 words into the past.
在這個例子中,為了確定哪個單詞應(yīng)該出現(xiàn)在ran之后,我們必須回顧過去的8個單詞。
If we want to improve on our second order language model, we can of course consider third- and higher order models.
如果我們想改進我們的二階語言模型,我們當(dāng)然可以考慮三階和更高階的模型。
However, with a significant vocabulary size this takes a combination of creativity and brute force to execute.
然而,對于一個大的詞匯量,這需要創(chuàng)造力和大量的計算來實現(xiàn)。
A naive implementation of an eighth order model would have N^8 rows, a ridiculous number for any reasonable vocubulary.
八階模型實現(xiàn)將會有N^8行,這對于任何合理的詞匯表來說都是一個荒謬的數(shù)字。
Instead, we can do something sly and make a second order model, but consider the combinations of the most recent word with each of the words that came before.
相反,我們可以采取一些巧妙的方法,制作一個二階模型,但考慮最近的單詞與之前每個單詞的組合。
It's still second order, because we're only considering two words at a time, but it allows us to reach back further and capture long range dependencies.
它仍然是二階的,因為我們每次只考慮兩個單詞,但它允許我們回溯更遠(yuǎn)并捕獲長距離的依賴關(guān)系。
The difference between this second-order-with-skips and a full umpteenth-order model is that we discard most of the word order information and combinations of preceeeding words.
?這個二階跳躍模型和完整的高階模型之間的區(qū)別在于,我們舍棄了大部分單詞順序信息和前面單詞的組合。
What remains is still pretty powerful.
剩下的信息仍然非常有力。
Markov chains fail us entirely now, but we can still represent the link between each pair of preceding words and the words that follow.
現(xiàn)在馬爾可夫鏈已經(jīng)對我們失效了,但我們?nèi)匀豢梢员硎久恳粚η懊娴膯卧~和后面的單詞之間的關(guān)聯(lián)。
Here we've dispensed with numerical weights, and instead are showing only the arrows associated with non-zero weights.
在這里,我們摒棄了數(shù)值權(quán)重,而是只顯示與非零權(quán)重相關(guān)的箭頭。
Larger weights are shown with heavier lines.
權(quán)重越大,線條越粗。

Here's what it might look like in a transition matrix.
? ?
下面是它在轉(zhuǎn)換矩陣中的樣子。

This view only shows the rows relevant to predicting the word that comes after ran.?
此視圖僅顯示與預(yù)測 ran 之后的單詞相關(guān)的行。
It shows instances where the most recent word (ran) is preceded by each of the other words in the vocabulary.?
它顯示了最近單詞 (ran) 前面是詞匯表中其他每個單詞的實例。
Only the relevant values are shown. All the empty cells are zeros.
只顯示相關(guān)值。所有空單元格為零。
The first thing that becomes apparent is that, when trying to predict the word that comes after ran, we no longer look at just one line, but rather a whole set of them.?
首先顯而易見的是,當(dāng)試圖預(yù)測 ran 后面的單詞時,我們不再只看一行,而是看一整組。
We've moved out of the Markov realm now.?
我們現(xiàn)在已經(jīng)離開馬爾可夫的范圍。
Each row no longer represents the state of the sequence at a particular point.?
每一行不再表示序列在特定點的狀態(tài)。
Instead, each row represents one of many features that may describe the sequence at a particular point.?
相反,每行表示可能描述特定點序列的眾多特征之一。
The combination of the most recent word with each of the words that came before makes for a collection of applicable rows, maybe a large collection.?
將最近的單詞與前面的每個單詞組合在一起,可以生成適用行的集合,也許是一個大型集合。
Because of this change in meaning, each value in the matrix no longer represents a probability, but rather a vote. Votes will be summed and compared to determine next word .
由于這種含義的變化,矩陣中的每個值不再代表一個概率,而是一個投票。投票將被匯總并進行比較,以確定下一個單詞預(yù)測。
The next thing that becomes apparent is that most of the features don't matter.?
接下來顯而易見的是,大多數(shù)功能都無關(guān)緊要。
Most of the words appear in both sentences, and so the fact that they have been seen is of no help in predicting what comes next.?
大多數(shù)單詞都出現(xiàn)在這兩個句子中,因此事實上它們被看到也無助于預(yù)測接下來會發(fā)生什么。
They all have a value of .5.?
它們的值均為 .5。
The only two exceptions are battery and program.?
唯一的兩個例外是電池和程序。
They have some 1 and 0 weights associated with the.?
它們有一些 1 和 0 與 相關(guān)的權(quán)重。
The feature battery, ran indicates that ran was the most recent word and that battery occurred somewhere earlier in the sentence.?
“battery”特征,ran表示ran是最近的單詞,并且“battery”出現(xiàn)在句子的前面的某個地方。
This feature has a weight of 1 associated with down and a weight of 0 associated with please.
此特征具有與向下關(guān)聯(lián)的權(quán)重1和與”please“關(guān)聯(lián)的權(quán)重0。
Similarly, the feature program, ran has the opposite set of weights.?
類似地,特征“program”,“ran”具有相反的權(quán)重集。
This structure shows that it is the presence of these two words earlier in the sentence that is decisive in predicting which word comes next.
這種結(jié)構(gòu)表明,正是這兩個詞在句子的前面出現(xiàn),才是預(yù)測下一個單詞的決定性因素。
To convert this set of word-pair features into a next word estimate, the values of all the relevant rows need to be summed.
要將這組單詞對特征轉(zhuǎn)換為下一個單詞估計,需要對所有相關(guān)行的值求和。
Adding down the column, the sequence Check the program log and find out whether it ran generates sums of 0 for all the words, except a 4 for down and a 5 for please.?
將列向下加,序列檢查“program log”并找出它是否“ran”為所有單詞生成0的和,除了4表示向下和5表示“please”。
The sequence Check the battery log and find out whether it ran does the same, except with a 5 for down and a 4 for please.?
順序檢查“battery log;并找出它是否“ran”也一樣,除了5表示向下,4表示“please”。
By choosing the word with the highest vote total as the next word prediction, this model gets us the right answer, despite having an eight word deep dependency.
通過選擇投票總數(shù)最高的單詞作為下一個單詞預(yù)測,這個模型得到了正確的答案,盡管有八個單詞的深度依賴。