CNN卷積神經(jīng)網(wǎng)絡--MIT公開課程《Introduction to Machine Learning》第八章譯文
CNN卷積神經(jīng)網(wǎng)絡 Convolutional Neural Networks
MIT公開課程《Introduction to Machine Learning》第八章譯文
?
So far, we have studied what are called fully connected neural networks, in which all of the units at one layer are connected to all of the units in the next layer. This is a good arrangement when we don’t know anything about what kind of mapping from inputs to outputs we will be asking the network to learn to approximate. But if we do know something about our problem, it is better to build it into the structure of our neural network. Doing so can save computation time and significantly diminish the amount of training data required to arrive at a solution that generalizes robustly.
到目前為止,我們已經(jīng)研究了所謂的全連接神經(jīng)網(wǎng)絡,其中一層的所有單元都連接到下一層的全部單元。當我們不知道從輸入到輸出的映射類型時,這是一個很好的安排,并且我們要求網(wǎng)絡學會逼近。但如果我們確實知道一些關于我們的問題的信息,那么最好將其構(gòu)建到我們的神經(jīng)網(wǎng)絡結(jié)構(gòu)中。這樣做可以節(jié)省計算時間,并顯著減少訓練數(shù)據(jù)量以獲得穩(wěn)健推廣的解決方案。
?
One very important application domain of neural networks, where the methods have achieved an enormous amount of success in recent years, is signal processing. Signals might be spatial (in two-dimensional camera images or three-dimensional depth or CAT scans) or temporal (speech or music). If we know that we are addressing a signal-processing problem, we can take advantage of invariant properties of that problem. In this chapter, we will focus on two-dimensional spatial problems (images) but use one-dimensional ones as a simple example. Later, we will address temporal problems.
神經(jīng)網(wǎng)絡的一個非常重要的應用領域是信號處理,近年來這些方法已經(jīng)取得了巨大的成功。信號可能是空間的(二維相機圖像或三維深度或CAT掃描)或時間的(語音或音樂)。如果我們知道正在解決一個信號處理問題,就可以利用這個問題的不變性。本文將關注二維空間問題(圖像),但將一維問題作為一個簡單的例子。稍后,我們將討論時間問題。
?
Imagine that you are given the problem of designing and training a neural network that takes an image as input, and outputs a classification, which is positive if the image contains a cat and negative if it does not. An image is described as a two-dimensional array of pixels( A pixel is a “picture element.”), each of which may be represented by three integer values, encoding intensity levels in red, green, and blue color channels.
想象一下,你面臨的問題是設計和訓練一個神經(jīng)網(wǎng)絡,該網(wǎng)絡將圖像作為輸入,并輸出一個分類,如果圖像包含貓,則分類為正,如果不包含貓,分類為負。圖像被描述為二維像素陣列(像素是“圖片元素”),每個像素可以由三個整數(shù)值表示,以紅色、綠色和藍色通道編碼強度級別。
?
There are two important pieces of prior structural knowledge we can bring to bear on this problem:
對于這個問題,我們可以利用兩個重要的先驗結(jié)構(gòu)知識:
Spatial locality: The set of pixels we will have to take into consideration to find a cat will be near one another in the image. So, for example, we won’t have to consider some combination of pixels in the four corners of the image, in order to see if they encode cat-ness.
空間局部性:為了找到一只貓,我們必須考慮的像素集將在圖像中彼此靠近。例如,我們不必考慮圖像四角的像素組合以查看它們是否存在貓。(個人理解:貓的圖像像素集是一整塊“粘”在一起的,所以在尋找存在貓的像素集時不必考慮零散/東一塊西一塊的組合平湊后是否會存在貓)
Translation invariance: The pattern of pixels that characterizes a cat is the same no matter where in the image the cat occurs. Cats don’t look different if they’re on the left or the right side of the image.
平移不變性:無論貓出現(xiàn)在圖像中的哪個位置,捕獲貓?zhí)卣鞯南袼啬J绞窍嗤?。貓在圖像的左側(cè)或右側(cè)看起來并沒有什么不同。
We will design neural network structures that take advantage of these properties.
我們將設計利用這些特性的神經(jīng)網(wǎng)絡結(jié)構(gòu)。
?
1 Filters
We begin by discussing image filters (Unfortunately in AI/ML/CS/Math, the word “filter” gets used in many ways: in addition to the one we describe here, it can describe a temporal process (in fact, our moving averages are a kind of filter) and even a somewhat esoteric algebraic structure). An image filter is a function that takes in a local spatial neighborhood of pixel values and detects the presence of some pattern in that data.
我們首先討論圖像過濾器(不幸的是,在AI/ML/CS/Math中,“過濾器”一詞有很多種用法:除了我們在這里描述的過濾器,它還可以描述一個時間過程(事實上,我們的移動平均值是一種過濾器),甚至是一種有點深奧的代數(shù)結(jié)構(gòu))。圖像過濾器是一種函數(shù),它獲取像素值的局部空間鄰域,并檢測數(shù)據(jù)中是否存在某種圖案。
?
Let’s consider a very simple case to start, in which we have a 1-dimensional binary “image” and a filter F of size two. The filter is a vector of two numbers, which we will move along the image, taking the dot product between the filter values and the image values at each step, and aggregating the outputs to produce a new image.
讓我們考慮一個非常簡單的例子,我們有一個一維二進制“圖像”和一個大小為2的濾波器F。過濾器是兩個數(shù)字的矢量,它將沿著圖像移動,在每個步驟中獲取過濾器值和圖像值之間的點積,并聚合(加總)輸出以生成新圖像。
Let X be the original image, of size d; then pixel i of the the output image is specified by
設X為原始圖像,大小為d;則輸出圖像的像素i如下所示

To ensure that the output image is also of dimension d, we will generally “pad” the input image with 0 values if we need to access pixels that are beyond the bounds of the input image. This process of applying the filter to the image to create a new image is called “convolution.” (And filters are also sometimes called convolutional kernels.)
為了確保輸出圖像也是維度d的,如果我們需要訪問超出輸入圖像邊界的像素,通常會用0值“填充”輸入圖像。將濾波器應用于圖像以創(chuàng)建新圖像的過程稱為“卷積” (濾波器有時也稱為卷積核。)
?
If you are already familiar with what a convolution is, you might notice that this definition corresponds to what is often called a correlation and not to a convolution. Indeed, correlation and convolution refer to different operations in signal processing. However, in the neural networks literature, most libraries implement the correlation (as described in this chapter) but call it convolution. The distinction is not significant; in principle, if convolution is required to solve the problem, the network could learn the necessary weights. For a discussion of the difference between convolution and correlation and the conventions used in the literature you can read section 9.1 in this excellent book: https://www.deeplearningbook.org.
如果您已經(jīng)熟悉卷積是什么,您可能會注意到這個定義對應的是通常所稱的相關性,而不是卷積。實際上,相關和卷積指的是信號處理中的不同操作。然而,在神經(jīng)網(wǎng)絡文獻中,大多數(shù)庫實現(xiàn)相關性(如本章所述),但稱其為卷積。區(qū)別不大;原則上,如果需要卷積來解決問題,網(wǎng)絡可以學習必要的權重。關于卷積和相關性之間的差異以及文獻中使用的慣例的討論,您可以閱讀本優(yōu)秀書籍中的9.1節(jié):https://www.deeplearningbook.org.
?
Here is a concrete example. Let the filter F1 = (?1, +1). Then given the first image below, we can convolve it with filter F1 to obtain the second image. You can think of this filter as a detector for “l(fā)eft edges” in the original image—to see this, look at the places where there is a 1 in the output image, and see what pattern exists at that position in the input image. Another interesting filter is F2 = (?1, +1, ?1). The third image below shows the result of convolving the first image with F2.
這是一個具體的例子。設過濾器F1=(?1,+1)。然后,給定下面的第一幅圖像,我們可以將其與濾波器F1卷積以獲得第二幅圖像。您可以將此過濾器視為原始圖像中“左邊緣”的檢測器,為驗證該過濾器,可以查看輸出圖像中有1的位置,并查看輸入圖像中相同位置出現(xiàn)的形式。另一個有趣的過濾器是F2=(?1,+1,?1)。下面的第三幅圖像顯示了將第一幅圖像與F2卷積的結(jié)果。

?Two-dimensional versions of filters like these are thought to be found in the visual cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural images. Computer vision people used to spend a lot of time hand-designing filter banks. A filter bank is a set of sets of filters, arranged as shown in the diagram below.
人們認為,在所有哺乳動物大腦的視覺皮層中都可以找到這樣的二維濾鏡。類似的模式來自自然圖像的統(tǒng)計分析。計算機視覺人員過去花了大量時間手工設計濾波器組。濾波器組是一組濾波器,排列如下圖所示。

All of the filters in the first group are applied to the original image; if there are k such filters, then the result is k new images, which are called channels. Now imagine stacking all these new images up so that we have a cube of data, indexed by the original row and column indices of the image, as well as by the channel. The next set of filters in the filter bank will generally be three-dimensional: each one will be applied to a sub-range of the row and column indices of the image and to all of the channels.
第一組中的所有濾波器都應用于原始圖像;如果有k個這樣的過濾器,那么結(jié)果是k個新的圖像,它們被稱為通道。現(xiàn)在想象一下,將所有這些新圖像堆疊起來,這樣我們就有了一個數(shù)據(jù)立方體,由圖像的原始行和列索引以及通道索引。過濾器組中的下一組過濾器通常是三維的:每個過濾器將應用于圖像的行和列索引的子范圍以及所有通道。
?
These 3D chunks of data are called tensors( We will use a popular piece of neural-network software called Tensorflow because it makes operations on tensors easy). The algebra of tensors is fun, and a lot like matrix algebra, but we won’t go into it in any detail.
這些3D數(shù)據(jù)塊被稱為張量(我們將使用一種流行的神經(jīng)網(wǎng)絡軟件Tsensorflow,因為它使張量的操作變得容易)。張量代數(shù)很有趣,很像矩陣代數(shù),但我們不會詳細討論它。
?
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters in the first layer, f1 and f2. You can think of each one as “l(fā)ooking” for three pixels in a row, f1 vertically and f2 horizontally . Assuming our input image is n × n, then the result of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter (hard to draw!) that “l(fā)ooks for” a combination of two horizontal and two vertical bars (now represented by individual pixels in the two channels), resulting in a single final n × n image. When we have a color image as input, we treat it as having 3 channels, and hence as an n×n×3 tensor.
這里是一個更復雜的二維過濾示例。我們在第一層中有兩個3×3濾波器,f1和f2。你可以把每一個都看作是“尋找”一行中的三個像素,f1垂直,f2水平。假設我們輸入的圖像是n×n,那么用這兩個濾波器進行濾波的結(jié)果就是n×n×2張量?,F(xiàn)在我們應用一個張量濾波器(很難繪制!),它“尋找”兩個水平和兩個垂直條的組合(現(xiàn)在由兩個通道中的單個像素表示),從而生成一個最終的n×n圖像。當我們將彩色圖像作為輸入時,我們將其視為具有3個通道(RGB顏色通道,一幅RGB色彩模式的圖像含有3個原色通道,分別是紅原色通道,綠原色通道和藍原色通道,每種顏色都是三個整數(shù)的組合),因此視為n×n×3張量。

We are going to design neural networks that have this structure. Each “bank” of the filter bank will correspond to a neural-network layer. The numbers in the individual filters will be the “weights” (plus a single additive bias or offset value for each filter) of the network, which we will train using gradient descent. What makes this interesting and powerful (and somewhat confusing at first) is that the same weights are used many many times in the computation of each layer. This weight sharing means that we can express a transformation on a large image with relatively few parameters; it also means we’ll have to take care in figuring out exactly how to train it!
我們將設計具有這種結(jié)構(gòu)的神經(jīng)網(wǎng)絡。濾波器組的每個“組”將對應于一個神經(jīng)網(wǎng)絡層。單個濾波器中的數(shù)字將是網(wǎng)絡的“權重”(加上每個濾波器的單個加性偏差或偏移值),我們將使用梯度下降來訓練該網(wǎng)絡。這一有趣而強大(起初有些令人困惑)的原因是,在每一層的計算中多次使用相同的權重。這種權重共享意味著我們可以用相對較少的參數(shù)在大圖像上表達變換;這也意味著我們必須仔細研究如何訓練它!
?
We will define a filter layer l formally with:(For simplicity , we are assuming that all images and filters are square (having the same number of rows and columns). That is in no way necessary, but is usually fine and definitely simplifies our notation.)
我們將用以下形式定義過濾層l:(為了簡單起見,我們假設所有圖像和過濾器都是正方形的(具有相同的行數(shù)和列數(shù))。這絕對不是必要的,但通常很好,而且絕對簡化了我們的表示法。)

?(這段數(shù)學符號太多了,簡略敲一點)
過濾器的數(shù)量。。。;(此處每個過濾器應該都是一個三維的)
過濾器的大小為。。。。加1偏置值(對于這個濾波器)
步長sl是我們將濾波器應用于圖像的間距;到目前為止,在我們的所有示例中,我們都使用了1的步幅,但如果我們“跳過”并僅在圖像的奇數(shù)索引處應用過濾器,那么它將具有2的步幅(并生成一半大小的結(jié)果圖像);
輸入張量大小。。。
填充值:pl是我們在輸入邊緣周圍添加的額外像素數(shù)(通常值為0)。對于大小為nl?1×nl?2×ml?1的輸入,我們新的帶填充的有效輸入大小變?yōu)椋╪l?1+2*pl)

該層將產(chǎn)生大小為nl×nl×ml的輸出張量,其中nl=d(nl?1+2*pl?(kl?1))/sle.權重是定義濾波器的值:將有ml個不同的kl×kl×ml?1張量的權重值;加上每個濾波器可能有一個偏置項,這意味著每個濾波器還有一個權重值。帶偏置的濾波器的操作與上面的濾波器示例相同,只是我們將偏置添加到輸出。例如,如果我們在上面的濾波器F2中加入了0.5的偏置項,則輸出將是(?0.5,0.5,?0.5,0.5%,?1.5,1.5,?0.5),而不是(?1,0,?1,1,?2,1,-1,0)。
注1:ceiling function為向上進一取整數(shù)
This may seem complicated, but we get a rich class of mappings that exploit image structure and have many fewer weights than a fully connected layer would.
這可能看起來很復雜,但我們得到了豐富的映射類,它們利用了圖像結(jié)構(gòu),并且比完全連接的層具有更少的權重。
?
2 Max Pooling
It is typical to structure filter banks into a pyramid (Both in engineering and in nature), in which the image sizes get smaller in successive layers of processing. The idea is that we find local patterns, like bits of edges in the early layers, and then look for patterns in those patterns, etc. This means that, effectively, we are looking for patterns in larger pieces of the image as we apply successive filters. Having a stride greater than one makes the images smaller, but does not necessarily aggregate information over that spatial range.
典型的做法是將濾波器組構(gòu)造成金字塔(無論在工程上還是在自然界中),其中圖像大小在連續(xù)的處理層中變小。我們的想法是找到局部圖案,比如早期圖層中的邊緣,然后在這些圖案中尋找局部圖案等。這意味著,當我們應用連續(xù)的過濾器時,我們實際上是在圖像中較大的部分尋找局部圖案。步幅大于1會使圖像變小,但不一定會聚集該空間范圍內(nèi)的信息。
?
Another common layer type, which accomplishes this aggregation, is max pooling. A max pooling layer operates like a filter, but has no weights. Y ou can think of it as a pure functional layer, like a ReLU layer in a fully connected network. It has a filter size, as in a filter layer, but simply returns the maximum value in its field (We sometimes use the term receptive field or just field to mean the area of an input image that a filter is being applied to). Usually , we apply max pooling with the following traits:
? stride > 1, so that the resulting image is smaller than the input image; and
? k > =stride, so that the whole image is covered.
實現(xiàn)這種聚合的另一種常見層類型是最大池化。最大池化層的操作類似于過濾器,但沒有權重。你可以將其視為純功能層,就像完全連接網(wǎng)絡中的ReLU層。它與過濾器具有同樣大小,就像在過濾器層中一樣,但只返回其范圍(我們有時使用術語“接受場”或“僅場”來表示正在應用過濾器的輸入圖像的區(qū)域。)內(nèi)的最大值。通常,我們使用具有以下特征的最大池:
?步幅>1,使得得到的圖像小于輸入圖像;
?k>=步幅,以便覆蓋整個圖像。
?
As a result of applying a max pooling layer, we don’t keep track of the precise location of a pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer of stride = k = 2. This would map a 64 × 64 × 3 image to a 32 × 32 × 3 image. Note that max pooling layers do not have additional bias or offset values.
由于應用了最大池化層,我們無法跟蹤樣式的精確位置。這有助于過濾器不依賴位置學會識別樣式。
考慮最大池化層的步長=k=2。這將把64×64×3圖像映射到32×32×3圖像。請注意,最大池化層沒有額外的偏移或偏移值。
?
3 Typical architecture
Here is the form of a typical convolutional network:
以下是典型卷積網(wǎng)絡的形式:

After each filter layer there is generally a ReLU layer; there maybe be multiple filter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max pooling. Once the output is down to a relatively small size, there is typically a last fullyconnected layer, leading into an activation function such as softmax that produces the final output. The exact design of these structures is an art—there is not currently any clear theoretical (or even systematic empirical) understanding of how these various design choices affect overall performance of the network.
在每個過濾層之后,通常存在ReLU層;可能有多個過濾器/ReLU層,然后是最大池化層,然后還有一些過濾器/ReLU層,然后最大池化。一旦輸出減小到相對較小的大小,通常會有最后一個完全連接的層,導致產(chǎn)生最終輸出的激活函數(shù)(如softmax)。這些結(jié)構(gòu)的精確設計是一門藝術,目前還沒有對這些不同的設計選擇如何影響網(wǎng)絡的整體性能的任何明確的理論(甚至系統(tǒng)的經(jīng)驗)理解。
?
The critical point for us is that this is all just a big neural network, which takes an input and computes an output. The mapping is a differentiable function of the weights, which means we can adjust the weights to decrease the loss by performing gradient descent, and we can compute the relevant gradients using back-propagation! (Well, the derivative is not continuous, both because of the ReLU and the max pooling operations, but we ignore that fact.)
對我們來說,關鍵點是這是一個大的神經(jīng)網(wǎng)絡,它接受輸入并計算輸出。映射是權重的可微函數(shù)(由于ReLU和最大池操作,導數(shù)不是連續(xù)的,但我們忽略了這一事實。),這意味著我們可以通過執(zhí)行梯度下降來調(diào)整權重以減少損失,并且我們可以使用反向傳播來計算相關的梯度!
?
Let’s work through a very simple example of how back-propagation can work on a convolutional network. The architecture is shown below. Assume we have a one-dimensional single-channel image, of size n × 1 × 1 and a single k × 1 × 1 filter (where we omit the filter bias) in the first convolutional layer. Then we pass it through a ReLU layer and a fully-connected layer with no additional activation function on the output.
讓我們通過一個非常簡單的例子來說明反向傳播如何在卷積網(wǎng)絡上工作。架構(gòu)如下所示。假設我們有一個尺寸為n×1×1的一維單通道圖像和第一卷積層中的單個k×1×1。然后,將其通過ReLU層和完全連接的層,輸出上沒有附加的激活功能。
?

For simplicity assume k is odd, let the input image X = A0, and assume we are using squared loss. Then we can describe the forward pass as follows
為了簡單起見,假設k是奇數(shù),讓輸入圖像X=A0,并假設我們使用平方損失。然后我們可以如下描述向前傳遞
$$\displaystyle? Z_ i^1\displaystyle = {W^1}^ T \cdot A^0_{[i-\lfloor k/2 \rfloor : i + \lfloor k/2 \rfloor ]} \\ \displaystyle A^1\displaystyle = ReLU(Z^1) \\ \displaystyle A^2\displaystyle = {W^2}^ T A^1 \\ \displaystyle L(A^2, y)\displaystyle = (A^2-y)^2$$
How do we update the weights in filter W1?
我們?nèi)绾胃逻^濾器W1中的權重?
