《注意力矩陣乘法》Attention as matrix multiplication
來(lái)源:https://e2eml.school/transformers.html#softmax
中英雙語(yǔ)版,由各類翻譯程序和少量自己理解的意思做中文注釋
相關(guān)文章匯總在文集:Transformers from Scratch(中文注釋)
--------------------------------------------------------------------------------------------------------------------
Feature weights could be straightforward to build by counting how often each word pair/next word transition occurs in training, but attention masks are not.?
通過(guò)計(jì)算每個(gè)單詞對(duì)/下一個(gè)單詞轉(zhuǎn)換在訓(xùn)練中發(fā)生的頻率,可以很容易地建立特征權(quán)重,但注意力掩碼不是。
Up to this point, we've pulled the mask vector out of thin air.?
到目前為止,我們已經(jīng)憑空拉出了掩模矢量。
How transformers find the relevant mask matters.?
transformers是如何找到相關(guān)的掩碼。
It would be natural to use some sort of lookup table, but now we are focusing hard on expressing everything as matrix multiplications.?
使用某種查找表是很自然的,但現(xiàn)在我們專注于將所有內(nèi)容表示為矩陣乘法。
We can use the same?lookup?method we introduced above by stacking the mask vectors for every word into a matrix and using the one-hot representation of the most recent word to pull out the relevant mask.
我們可以使用與上面介紹的相同的查找方法,將每個(gè)單詞的掩碼向量堆疊到一個(gè)矩陣中,并使用最新單詞的獨(dú)熱表示來(lái)提取相關(guān)的掩碼。

In the matrix showing the collection of mask vectors, we've only shown the one we're trying to pull out, for clarity.
在顯示掩碼向量集合的矩陣中,為了清楚起見(jiàn),我們只顯示我們?cè)噲D提取的那個(gè)。
We're finally getting to the point where we can start tying into the paper.
我們終于到了可以開(kāi)始進(jìn)入到論文的地步。
This mask lookup is represented by the?QK^T?term in the attention equation.
這種掩碼查找由注意方程中的QK^T項(xiàng)表示。

The query?Q?represents the feature of interest and the matrix?K?represents the collection of masks.
查詢 Q 表示感興趣的特征,矩陣 K 表示掩碼的集合。
Because it's stored with masks in columns, rather than rows, it needs to be transposed (with the?T?operator) before multiplying.
因?yàn)樗怯醚诖a存儲(chǔ)在列中,而不是在行中,所以在乘法之前需要轉(zhuǎn)置(使用 T 運(yùn)算符)。
By the time we're all done, we'll make some important modifications to this, but at this level it captures the concept of a differentiable lookup table that transformers make use of.
當(dāng)我們完成所有操作時(shí),我們將對(duì)此進(jìn)行一些重要的修改,但在此級(jí)別,它捕獲了transformers使用的可微查找表的概念。