散文網(wǎng) » 科技 »學(xué)習 » GATO調(diào)研報告

GATO調(diào)研報告

2023-06-27 20:50 作者:Chaton丫 0人讀過 | 我要投稿

GATO

受大規(guī)模語言建模進展的啟發(fā)，我們應(yīng)用了類似的方法來構(gòu)建文本輸出領(lǐng)域之外的單個通用模型。我們將 Gato 稱為 Gato 的代理用作多模態(tài)、多任務(wù)、多體現(xiàn)通用policy。具有相同權(quán)重的相同網(wǎng)絡(luò)可以玩雅達利、字幕圖像、聊天、帶有真實機械臂的堆棧塊，等等，根據(jù)上下文決定是否輸出文本、關(guān)節(jié)扭矩、按鈕按下或其他標記。在本報告中，我們描述了模型和數(shù)據(jù)，并記錄了 Gato 的當前能力。

![截屏2023-06-27 20.27.41](/Users/chenwenshuo/Library/Application Support/typora-user-images/截屏2023-06-27 20.27.41.png)

本質(zhì)是一個token預(yù)測下一個token

Prediction Problem

GATO does not predict observations,only the token of next action

Similar to decision transformers(only no rewards in sequence)
僅僅是模仿experts的行為

Instead of one-hot task identifiers,prompt conditioning is used

Similar to T5 architecture:<prompt>+<sequence>
Prompt:samples of episode,with50% from end, and 50% uniformly sampled
goal-directed learning without rewards?？

details

like cross-entropy loss for supervised learning

最小化L(θ, B)

Tokenization

Data TypeMethodOrderingRangeTextSentencePiece with 32000 subwordsText order[0,32000]ImagesSplit into non-overlapping 16*16 pathes and use ViTRaster order[-1,1]for each pixel,divided by <br />quare root of patch size(i.e.4)Discrete Values<br />(e.g.)Atari actionsFlattened into sequences of integersRow-major order[0,1024]Continuous values<br />(e.g.propioceptive inputs)Flattened into sequences of floating point valuesRow-major orderMu-law encoded to [-1,1],<br />dicretized to 1024 uniform bins.Then shifted to[32000,33024]

? Episodes are presented to the agent in order of time (timesteps).
? Timesteps in turn are presented in the following order:
– Observations ([y1:k, x1:m, z1:n]) are ordered lexicographically by key, each item is sequenced as follows:
? Text tokens (y1:k) are in the same order as the raw input text.
? Image patch tokens (x1:m) are in raster order.
? Tensors (z1:n) (such as discrete and continuous observations) are in row-major order.
– Separator (′|′); a designated separator token is provided after observations.
– Actions (a1:A) are tokenized as discrete or continuous values and in row-major order.
A full sequence of tokens is thus given as the concatenation of data from T timesteps:

where L = T (k + m + n + 1 + A) is the total number of tokens.

Embedding Inputs

Parameterized embedding function f(·；θe) to each token

text, discrete or continous valued observations or actions are embedded via lookup table into learned vector embedding space
image patches embedding using a ResNet to obtain a vector per patch

Learnable position encoding vector added to vector

Image Embedding

Similar to ViT

Tokenization + Embedding Pipeline(Image+Dsicrete actions)

Tokenization + Embedding Pipeline(Propioception + Continuous actions)

Mu-law Encoding
對于均勻量化存在的問題則采樣非均勻量化解決。它的基本思想是對大信號進行壓縮而對小信號進行較大的放大。由于小信號的幅度得到較大的放大 ,從而使小信號的信噪比大為改善。目前常用的壓擴方法是對數(shù)型的 A律壓縮和 μ律壓縮 ,其中 μ律壓縮公式： y=ln(1+μx)/ln(1+μ）其中 x 為歸一化的量化器輸入 , y 為歸一化的量化器輸出。常數(shù) μ愈大 ,則小信號的壓擴效益愈高 ,目前多采用 μ= 255

Local Position Embedding

Original Transformer:

neighbouring tokens have high similarity
further tokens low similarity

but it is different when we add position encoding

這個或許可以理解為什么action local position encoding 都一樣，因為不同的local position encoding會導(dǎo)致產(chǎn)生bias，導(dǎo)致原本的action token意義發(fā)生變化

Training Details

Hardware:16*16 TPU v3 slice
Timesteps:1M
Batch size:512
Token sequence length:1024

Training time:4 days

Datasets

可以看出weight有很大在3D gaming environmen 只有15%在text方面

Training Procedure

Mimic expert trajectories from SOTA or near-SOTA agents
Train only with episodes returns of at least 80% of expert return

Decision Transformers: Train on all kinds of episodes because we have the goal as total reward gained
但是Gato不能從壞的樣本中學(xué)習，因為Gato沒有任何形式的獎勵模式
所以Gato的根本問題是只從好的樣本學(xué)習這一方面
因為只從好的學(xué)習就不知道壞的是怎么發(fā)生的，如果隨機初始到一個bad zone 那么就很有可能會出問題，因為以前從來沒有見過這種情況
在Robocat其實對這方面做了很大改進

future work：Supposedly possible to learn via RL from scratch

extrinsic rewards: environment rewards
spare settings,takes very long to learn (maybe do intrinsic reward modelling)
本人對從0開始學(xué)習這一方面存疑

Is the general agent as good as the expert?

因環(huán)境而異，在比較困難的環(huán)境中表現(xiàn)其實沒那么好，但總體來說還是可以的

Is GATO scalable?

The answer is yes

Normalized return:

Each task calculate performance of model as percentage of expert score
Average percentage score across all tasks of a domain
Mean-aggregate percentage score across all domains

Increasing tokens trained = increased performance
Increasing model size = increased performance

Can GATO generalize(zero-shot)?

Not too well on held-out set,zero-shot transfer is a common problem with ML techniques in general

Generalizability experiments(few-shot)

should ideally learn by conditioning on different prompts

sequence lengths of tokenized demonstrations too long
Maximum contnt length insufficient to describe task

Instead,fine-tune agent's parameters on new task ,and evaluate fine-tuned model's performance on environment
3 models

same domain only data:pretrained only from same domain as task to be fine-tuned on
no control data:pretrained only on non-control data
scratch:no pretraining at all

Non-image data: if we not have the right proprioception data,no use training on other data
Image data:different from different tasks

標簽：模仿學(xué)習多模態(tài)GATO 機器人深度學(xué)習大模型