GATO調(diào)研報告
GATO
受大規(guī)模語言建模進展的啟發(fā),我們應(yīng)用了類似的方法來構(gòu)建文本輸出領(lǐng)域之外的單個通用模型。我們將 Gato 稱為 Gato 的代理用作多模態(tài)、多任務(wù)、多體現(xiàn)通用policy。具有相同權(quán)重的相同網(wǎng)絡(luò)可以玩雅達利、字幕圖像、聊天、帶有真實機械臂的堆棧塊,等等,根據(jù)上下文決定是否輸出文本、關(guān)節(jié)扭矩、按鈕按下或其他標記。在本報告中,我們描述了模型和數(shù)據(jù),并記錄了 Gato 的當前能力。


本質(zhì)是一個token預(yù)測下一個token
Prediction Problem
GATO does not predict observations,only the token of next action
Similar to decision transformers(only no rewards in sequence)
僅僅是模仿experts的行為
Instead of one-hot task identifiers,prompt conditioning is used
Similar to T5 architecture:<prompt>+<sequence>
Prompt:samples of episode,with50% from end, and 50% uniformly sampled
goal-directed learning without rewards??


details
like cross-entropy loss for supervised learning
最小化L(θ, B)

Tokenization
Data TypeMethodOrderingRangeTextSentencePiece with 32000 subwordsText order[0,32000]ImagesSplit into non-overlapping 16*16 pathes and use ViTRaster order[-1,1]for each pixel,divided by <br />quare root of patch size(i.e.4)Discrete Values<br />(e.g.)Atari actionsFlattened into sequences of integersRow-major order[0,1024]Continuous values<br />(e.g.propioceptive inputs)Flattened into sequences of floating point valuesRow-major orderMu-law encoded to [-1,1],<br />dicretized to 1024 uniform bins.Then shifted to[32000,33024]
? Episodes are presented to the agent in order of time (timesteps).
? Timesteps in turn are presented in the following order:
– Observations ([y1:k, x1:m, z1:n]) are ordered lexicographically by key, each item is sequenced as follows:
? Text tokens (y1:k) are in the same order as the raw input text.
? Image patch tokens (x1:m) are in raster order.
? Tensors (z1:n) (such as discrete and continuous observations) are in row-major order.
– Separator (′|′); a designated separator token is provided after observations.
– Actions (a1:A) are tokenized as discrete or continuous values and in row-major order.
A full sequence of tokens is thus given as the concatenation of data from T timesteps:

where L = T (k + m + n + 1 + A) is the total number of tokens.
Embedding Inputs
Parameterized embedding function f(·;θe) to each token
text, discrete or continous valued observations or actions are embedded via lookup table into learned vector embedding space
image patches embedding using a ResNet to obtain a vector per patch
Learnable position encoding vector added to vector
Image Embedding
Similar to ViT


Tokenization + Embedding Pipeline(Image+Dsicrete actions)

Tokenization + Embedding Pipeline(Propioception + Continuous actions)

Mu-law Encoding
對于均勻量化存在的問題則采樣非均勻量化解決。它的基本思想是對大信號進行壓縮而對小信號進行較大的放大。由于小信號的幅度得到較大的放大 ,從而使小信號的信噪比大為改善。目前常用的壓擴方法是對數(shù)型的 A律壓縮和 μ律壓縮 ,其中 μ律壓縮公式: y=ln(1+μx)/ln(1+μ) 其中 x 為歸一化的量化器輸入 , y 為歸一化的量化器輸出。常數(shù) μ愈大 ,則小信號的壓擴效益愈高 ,目前多采用 μ= 255
Local Position Embedding

Original Transformer:
neighbouring tokens have high similarity
further tokens low similarity
but it is different when we add position encoding
這個或許可以理解為什么action local position encoding 都一樣,因為不同的local position encoding會導(dǎo)致產(chǎn)生bias,導(dǎo)致原本的action token意義發(fā)生變化
Training Details
Hardware:16*16 TPU v3 slice
Timesteps:1M
Batch size:512
Token sequence length:1024
Training time:4 days
Datasets

可以看出weight有很大在3D gaming environmen 只有15%在text方面
Training Procedure
Mimic expert trajectories from SOTA or near-SOTA agents
Train only with episodes returns of at least 80% of expert return
Decision Transformers: Train on all kinds of episodes because we have the goal as total reward gained
但是Gato不能從壞的樣本中學(xué)習,因為Gato沒有任何形式的獎勵模式
所以Gato的根本問題是只從好的樣本學(xué)習這一方面
因為只從好的學(xué)習就不知道壞的是怎么發(fā)生的,如果隨機初始到一個bad zone 那么就很有可能會出問題,因為以前從來沒有見過這種情況
在Robocat其實對這方面做了很大改進
future work:Supposedly possible to learn via RL from scratch
extrinsic rewards: environment rewards
spare settings,takes very long to learn (maybe do intrinsic reward modelling)
本人對從0開始學(xué)習這一方面存疑
Is the general agent as good as the expert?

因環(huán)境而異,在比較困難的環(huán)境中表現(xiàn)其實沒那么好,但總體來說還是可以的
Is GATO scalable?

The answer is yes
Normalized return:
Each task calculate performance of model as percentage of expert score
Average percentage score across all tasks of a domain
Mean-aggregate percentage score across all domains
Increasing tokens trained = increased performance
Increasing model size = increased performance
Can GATO generalize(zero-shot)?


Not too well on held-out set,zero-shot transfer is a common problem with ML techniques in general
Generalizability experiments(few-shot)
should ideally learn by conditioning on different prompts
sequence lengths of tokenized demonstrations too long
Maximum contnt length insufficient to describe task
Instead,fine-tune agent's parameters on new task ,and evaluate fine-tuned model's performance on environment
3 models
same domain only data:pretrained only from same domain as task to be fine-tuned on
no control data:pretrained only on non-control data
scratch:no pretraining at all

Non-image data: if we not have the right proprioception data,no use training on other data
Image data:different from different tasks