45篇Transformer精選論文分享!模型、架構(gòu)、訓(xùn)練方法一次看完!
今天來(lái)聊聊transformer。
得益于ChatGPT的爆火,今年大模型可謂是人工智能領(lǐng)域最熱門(mén)的研究方向,作為大模型奠基之作的transformer也重新活躍在眾人面前,新的研究成果一個(gè)接一個(gè)出,學(xué)姐銳評(píng):卷。
對(duì)于剛?cè)腴T(mén)AI的同學(xué)來(lái)說(shuō),transformer是必學(xué)的知識(shí)點(diǎn);對(duì)于其他人工智能領(lǐng)域的同學(xué)來(lái)說(shuō),transformer更是必須要掌握的基礎(chǔ)。
所以學(xué)姐這回幫大家整理了transformer相關(guān)的論文資料,包括23篇模型相關(guān)論文,10篇架構(gòu)相關(guān)論文,8篇預(yù)訓(xùn)練后處理,4篇訓(xùn)練方法,方面剛?cè)腴T(mén)的小白快速上手,也方便其他同學(xué)梳理自己的知識(shí)體系。
論文list如下:
掃碼添加小享,回復(fù)“精選45”??
免費(fèi)獲取全部45篇論文+代碼合集

一、模型(23)
GPT
Improving Language Understanding by Generative Pre-Training
GPT-2
Language Models are Unsupervised Multitask Learners
GPT-3
Language Models are Few-Shot Learners
GPT-3.5
Models referred to as"GPT 3.5"
GPT-4
GPT-4 Technical Report
GPT-NeoX
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-J
Pretrained Models
Gopher
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
AlphaCode
Competition-Level Code Generation with AlphaCode
RETRO
Improving language models by retrievingfrom trillions of tokens
Chinchilla
Training Compute-Optimal Large Language Models
Flamingo
Flamingo: a Visual Language Model for FewShot Learning
Gato
A Generalist Agent
Anthropic LM
A General Language Assistantas a Laboratory for Alignment
PaLM
PaLM: Scaling Language Modeling with Pathways
GLaM
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
LAMDA
LaMDA: Language Models for Dialog Applications
LLaMA
Open and Efficient Foundation Language Models
Switch
Switch Transformers: Scaling to Trillion Parameter Modelswith Simple and Efficient Sparsity
BLOOM
BLOOM: A 176B-Parameter Open-Access MultilingualLanguage Model
Galactica
Galactica: A Large Language Model for Science
OPT
OPT: Open Pre-trained Transformer Language Models
GLM-130B
GLM-130B: AN OPEN BILINGUAL PRE-TRAINEDMODEL
二、架構(gòu)(10)
多查詢(xún)注意力
Fast Transformer Decoding: One Write-Head is All You Need
稀疏注意力
Generating Long Sequences with Sparse Transformers
混合專(zhuān)家
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS
Efficient Large Scale Language Modeling with Mixtures of Experts
FlashAttention
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness
編碼器 + 解碼器
Attention Is All You Need
平行注意力
PaLM: Scaling Language Modeling with Pathways
RoPE
ROFORMER: ENHANCED TRANSFORMER WITH ROTARYPOSITION EMBEDDING
ALiBi
TRAIN SHORT.TEST LONG: ATTENTION WITH LINEARBIASES ENABLES INPUT LENGTH EXTRAPOLATION
三、預(yù)訓(xùn)練后處理(8)
采用 PPO 算法的 RLHF
Deep Reinforcement Learning from Human Preferences
Learning to summarize from human feedback
Constitutional
Constitutional Al: Harmlessness from AI Feedback
Minerva
Solving Quantitative Reasoning Problems with Language Models
Codex
Evaluating Large Language Models Trained on Code
FeedME (SFT)
Training language models to follow instructions with human feedback
Fine-Tuning Language Models from Human Preferences
FLAN
FINETUNED LANGUAGE MODELS ARE ZERO-SHOTLEARNERS
四、訓(xùn)練方法(4)
設(shè)置超參數(shù)
Training Compute-Optimal Large Language Models
Scaling Laws for Neural Language Models
基于人類(lèi)反饋的預(yù)訓(xùn)練
Pretraining Language Models with Human Preferences
MuP
Tensor Programs V:Tuning Large Neural Networks viaZero-Shot Hyperparameter Transfer
掃碼添加小享,回復(fù)“精選45”??
免費(fèi)獲取全部45篇論文+代碼合集
