對比學(xué)習(xí)論文綜述【論文精讀】

第一階段 - 百花齊放
InstDisc (2018) - 提出實(shí)例判別和memory bank做對比學(xué)習(xí) ?01:48?
Title: Unsupervised Feature Learning via Non-Parametric Instance Discrimination

每一個(gè) instance 都看成是一個(gè)類別

對于 ImageNet 數(shù)據(jù)集,一共有128萬張圖片,因此 Memory Bank 是一個(gè) 1280000*128 的數(shù)據(jù)矩陣

主要貢獻(xiàn):
- 提出 Instance Discrimination (個(gè)體判別) 代理任務(wù)
- 使用代理任務(wù)和 NCE Loss 進(jìn)行對比學(xué)習(xí),從而取得了不錯(cuò)的無監(jiān)督表征學(xué)習(xí)結(jié)果
- 提出用 Memory Bank 這種數(shù)據(jù)結(jié)構(gòu)來存儲大量的負(fù)樣本
- 提出基于 Momentum (動量) 的模型參數(shù)更新方法 (Proximal Regularization:給模型的訓(xùn)練加了一個(gè)約束,后續(xù)的 MoCo 的想法與其一致)

temperature $\tau=0.07$, number of negative samples $m=4096$, 200 epochs, SGD with momentum, batch size 256, learning rate 0.03
里程碑式的工作,對后續(xù)的 Contrastive Learning 工作起到了至關(guān)重要的推進(jìn)作用
InvaSpread (CVPR 19) - 只使用一個(gè)編碼器的 end-to-end (端到端)對比學(xué)習(xí) ?07:01?
Title: Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

特點(diǎn):不需要借助額外的數(shù)據(jù)結(jié)構(gòu)去存儲大量的負(fù)樣本,正負(fù)樣本來自同一個(gè) minibatch (可以理解為 SimCLR 的前身)

invariant and spreading
batchsize: 256, positive samples: 256, negative samples: (256-1)*2
代理任務(wù):Instance Discrimination (個(gè)體判別)
從同一個(gè) minibatch 中選擇正負(fù)樣本,是為了可以用一個(gè)編碼器做 end-to-end 訓(xùn)練,但同時(shí)為了保證模型性能,batch size必須設(shè)置得足夠大
結(jié)果不夠好的原因在于:負(fù)樣本的數(shù)量不夠多,所維護(hù)的由負(fù)樣本組成的字典不夠大
CPC v1 (2019) - 對比預(yù)測編碼,圖像語音文本強(qiáng)化學(xué)習(xí)全都能做 ?10:30?
Title: Representation Learning with Contrastive Predictive Coding

代理任務(wù):基于 Predictive (預(yù)測)的
將當(dāng)前時(shí)刻和之前時(shí)刻的 inputs 分別喂入 encoders $g_{\text{enc}}$,將 encoders 輸出的 features 喂入自回歸模型 $g_{\text{ar}}$ (如 RNN, LSTM 等),最終得到當(dāng)前時(shí)刻的output (context representation) $c_t$
如果所學(xué)到的 context representation 足夠好,也即包含了當(dāng)前和之前所有信息的話,那么它應(yīng)該可以對未來時(shí)刻作出合理預(yù)測
positive samples: 未來時(shí)刻的特征輸出, negative samples: 任意選擇的時(shí)刻的特征輸出
特點(diǎn):通用的結(jié)構(gòu),適用于語音,文本,圖像 (將序列想象成圖片的 patch 塊),非常靈活
CMC (2020) - 多視角下的對比學(xué)習(xí) ?13:00?
Title: Contrastive Multiview Coding

一個(gè) object 的 multi-view 都可以當(dāng)做是 positive sample
缺點(diǎn)是所需要的 encoders 過多
不僅證明了對比學(xué)習(xí)的靈活性,而且證明了這種多視角、多模態(tài)的可行性
Abstract:
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt).
人觀察這個(gè)世界是通過很多個(gè)傳感器,比如說眼睛或者耳朵都充當(dāng)著不同的傳感器來給大腦提供不同的信號。每一個(gè)視角都是帶有噪聲的,而且有可能是不完整的,但是最重要的那些信息其實(shí)是在所有的這些視角中間共享,比如說基礎(chǔ)的物理定律、幾何形狀或者說它們的語音信息都是共享的,比如一個(gè)狗既可以被看見,也可以被聽到或者被感受到
The classic hypothesis: a powerful representation is one that models view-invariant factors.
一個(gè)非常強(qiáng)大的特征,它具有視角的不變性(不管看哪個(gè)視角,到底是看到了一只狗,還是聽到了狗叫聲,都能判斷出這是個(gè)狗)
Learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact.
學(xué)習(xí)目標(biāo):增大所有的視角之間的互信息
The contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
小結(jié)
- 代理任務(wù):Instance Discrimination, predictive, multi-view, multi-modal
- 目標(biāo)函數(shù):NCE, InfoNCE, 和其他變體
- 模型架構(gòu):一個(gè) encoder + memory bank (Inst Disc); 一個(gè) encoder (Invariant Spread); 一個(gè) encoder + 一個(gè) auto regressive (CPC); 多個(gè) encoders (CMC)
- 任務(wù)類型:圖像,音頻,文字,強(qiáng)化學(xué)習(xí)等
第二階段 - CV雙雄
MoCov1 (CVPR 2020) - 無監(jiān)督訓(xùn)練效果也很好 ?18:28?
Title: Momentum Contrast for Unsupervised Visual Representation Learning

將之前 Contrastive Learning 方法都?xì)w納總結(jié)為字典查詢問題
主要貢獻(xiàn):
- 隊(duì)列 queue,使得 batch size 與 queue size 解耦,可以維護(hù)一個(gè)又大又一致的字典
- momentum encoder
SimCLRv1 (ICML 2020) - 簡單的對比學(xué)習(xí) (數(shù)據(jù)增強(qiáng) + MLP head + 大batch訓(xùn)練久) ?23:00?
Title: A Simple Framework for Contrastive Learning of Visual Representations


positive samples: batchsize, negative samples: 2*(batchsize-1)
encoder $f(\cdot)$ 共享權(quán)重
增加了 projection head $g(\cdot)$,并且只用于 training,不用于 downstream tasks

主要貢獻(xiàn):
- Composition of multiple data augmentation operations is crucial in defining the contrastive prediction tasks that yield effective representations. In addition, unsupervised contrastive learning benefits from stronger data augmentation than supervised learning. 使用了更多的 data augmentation (數(shù)據(jù)增強(qiáng)) 方法
- Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations. 非線性變換,在編碼器后加一層 MLP
- Representation learning with contrastive cross entropy loss benefits from normalized embeddings and an appropriately adjusted temperature parameter. 正則化 + tenperature 參數(shù)
- Contrastive learning benefits from larger batch sizes and longer training compared to its supervised counterpart. Like supervised learning, contrastive learning benefits from deeper and wider networks. 訓(xùn)練更大的 batchsize,訓(xùn)練時(shí)間更久

Linear: projection head without ReLU
Non-Linear: the whole projection head
None: without projection head
- 使用 Non-Linear,相比原來什么都不用,結(jié)果提升了十幾個(gè)點(diǎn),效果非常顯著 (原因不明)
- 最后的維度不論是 32, 64 還是 2048 其實(shí)都沒太大區(qū)別 (為什么大多選擇較低的特征維度,因?yàn)?28就夠了)
MoCov2 (2020) ?31:00?
Title: Improved Baselines with Momentum Contrastive Learning
MoCov1 + improvements from SimCLRv1


SimCLRv2 (2020) - 大的自監(jiān)督預(yù)訓(xùn)練模型很適合做半監(jiān)督學(xué)習(xí) ?36:15?
Title: Big Self-Supervised Models are Strong Semi-Supervised Learners

主要改進(jìn):
- 將 backbone network 從 ResNet-50 換成 ResNet-152,并配備3倍寬度的 channels 和 selective kernels (SK) net
- 將 projection head 從一層 MLP 換成兩層 MLP
- 使用 momentum encoder
受啟發(fā)與 google 的 noisy student 的工作
SWaV (NeurIPS 2020) - 聚類對比學(xué)習(xí) ?40:24?
Title: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

使用聚類的好處:
- 如果與每一個(gè) instance-like 的負(fù)樣本去做對比,則需要成千上萬個(gè)負(fù)樣本,而且即使如此也只是近似;相反,如果是與 cluster centers 做對比,在 ImageNet 上使用幾百或者最多3000個(gè) cluster centers 足矣
- cluster centers 具有明確的語義含義,相比較于在 instance-like 的負(fù)樣本中 random sampling 會碰到如某些正樣本也會被 sampling 和樣本類別不均衡等問題,不如使用 cluster centers 有效

Multi-Crop: 全局特征 + 局部特征

推薦閱讀:Deep Cluster, Deep Cluster 2
CPC v2 (ICML 2020) ?49:58?
Title: Data-Efficient Image Recognition with Contrastive Predictive Coding
Improvements:
- Model Capacity
- Layer Normalization
- Predicting lengths and directions
- Patch-based Augmentation:
InfoMin (NeurIPS 2020) ?50:30?
Title: What Makes for Good Views for Contrastive Learning
主要為分析型延伸工作,Minimize Mutual Information。主要觀點(diǎn)是合適的 Mutual Information 很重要
提出了一個(gè)新的 InfoMin Principle,其目的是使特征表示在學(xué)到不同 view 之間共享的信息之外,盡量去除與下游任務(wù)無關(guān)的冗余信息來保證學(xué)習(xí)到的特征表示具有好的泛化能力。
引入三個(gè)定義:
- Sufficient Encoder
- Minimal Sufficient Encoder
- Optimal Representation of a Task
小結(jié)
目標(biāo)函數(shù):類 infoNCE
模型:一個(gè) encoder + projection head
more strongly data augmentation
momentum encoder
訓(xùn)練更久
ImageNet 上的 accuracy 逼近有監(jiān)督的 baseline
第三階段-不用負(fù)樣本
BYOL (2020) - 不需要負(fù)樣本的對比學(xué)習(xí) ?52:31?
Title: Bootstrap your own latent: A new approach to self-supervised Learning

目標(biāo)函數(shù):

針對BYOL的博客 (https://generallyintelligent.com/research/2020-08-24-understanding-self-supervised-contrastive-learning) 和他們的回應(yīng) (https://arxiv.org/pdf/2010.10241.pdf)
SimSiam (CVPR 2021) - 化繁為簡的孿生表征學(xué)習(xí) ?01:09:31?
Title: Exploring Simple Siamese Representation Learning


沒有使用 (i) negative sample pairs, (ii) large batches, (iii) momentum encoders
猜測 stop gradient 機(jī)制是至關(guān)重要的
Hypothesis: SimSiam is an implementation of an Expectation-Maximization (EM) like algorithms.


Barlow Twins (ICML 2021) ?01:16:26?
Title: Barlow Twins: Self-Supervised Learning via Redundancy Reduction
既沒有作對比,也沒有作預(yù)測,本質(zhì)上是換了一個(gè)目標(biāo)函數(shù)。
具體而言,是生成 Cross Correlation Matrix (關(guān)聯(lián)矩陣),希望該矩陣與 Identity Matrix 盡可能相似
第四階段-基于 Transformer
MoCov3 (CVPR 2021) - 如何更穩(wěn)定的自監(jiān)督訓(xùn)練ViT ?01:17:23?
Title: An Empirical Study of Training Self-Supervised Vision Transformers



DINO - transformer加自監(jiān)督在視覺也很香 ?01:23:10?
Title: Emerging Properties in Self-Supervised Vision Transformers



總結(jié) ?01:25:53?
