最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

歡迎光臨散文網(wǎng) 會員登陸 & 注冊

GATO調(diào)研報告

2023-06-27 20:50 作者:Chaton丫  | 我要投稿

GATO

受大規(guī)模語言建模進展的啟發(fā),我們應(yīng)用了類似的方法來構(gòu)建文本輸出領(lǐng)域之外的單個通用模型。我們將 Gato 稱為 Gato 的代理用作多模態(tài)、多任務(wù)、多體現(xiàn)通用policy。具有相同權(quán)重的相同網(wǎng)絡(luò)可以玩雅達利、字幕圖像、聊天、帶有真實機械臂的堆棧塊,等等,根據(jù)上下文決定是否輸出文本、關(guān)節(jié)扭矩、按鈕按下或其他標記。在本報告中,我們描述了模型和數(shù)據(jù),并記錄了 Gato 的當前能力。

![截屏2023-06-27 20.27.41](/Users/chenwenshuo/Library/Application Support/typora-user-images/截屏2023-06-27 20.27.41.png)

截屏2023-06-27 20.27.41
截屏2023-06-27 20.26.19

本質(zhì)是一個token預(yù)測下一個token

Prediction Problem

  • GATO does not predict observations,only the token of next action

Similar to decision transformers(only no rewards in sequence)

僅僅是模仿experts的行為

  • Instead of one-hot task identifiers,prompt conditioning is used

  • Similar to T5 architecture:<prompt>+<sequence>

  • Prompt:samples of episode,with50% from end, and 50% uniformly sampled

goal-directed learning without rewards??


v2-5838bda0fd7c47c5d27d1eb42045339e_r
截屏2023-06-27 15.27.44

details

like cross-entropy loss for supervised learning

最小化L(θ, B)

截屏2023-06-27 15.37.13

Tokenization


Data TypeMethodOrderingRangeTextSentencePiece with 32000 subwordsText order[0,32000]ImagesSplit into non-overlapping 16*16 pathes and use ViTRaster order[-1,1]for each pixel,divided by <br />quare root of patch size(i.e.4)Discrete Values<br />(e.g.)Atari actionsFlattened into sequences of integersRow-major order[0,1024]Continuous values<br />(e.g.propioceptive inputs)Flattened into sequences of floating point valuesRow-major orderMu-law encoded to [-1,1],<br />dicretized to 1024 uniform bins.Then shifted to[32000,33024]

? Episodes are presented to the agent in order of time (timesteps).

? Timesteps in turn are presented in the following order:

– Observations ([y1:k, x1:m, z1:n]) are ordered lexicographically by key, each item is sequenced as follows:

? Text tokens (y1:k) are in the same order as the raw input text.

? Image patch tokens (x1:m) are in raster order.

? Tensors (z1:n) (such as discrete and continuous observations) are in row-major order.

– Separator (′|′); a designated separator token is provided after observations.

– Actions (a1:A) are tokenized as discrete or continuous values and in row-major order.

A full sequence of tokens is thus given as the concatenation of data from T timesteps:

截屏2023-06-27 15.55.07

where L = T (k + m + n + 1 + A) is the total number of tokens.

Embedding Inputs

  • Parameterized embedding function f(·;θe) to each token

  • text, discrete or continous valued observations or actions are embedded via lookup table into learned vector embedding space

  • image patches embedding using a ResNet to obtain a vector per patch

  • Learnable position encoding vector added to vector



  • Image Embedding

Similar to ViT

截屏2023-06-27 16.06.13
截屏2023-06-27 16.31.22
  • Tokenization + Embedding Pipeline(Image+Dsicrete actions)

截屏2023-06-27 16.08.24
  • Tokenization + Embedding Pipeline(Propioception + Continuous actions)

截屏2023-06-27 16.11.50
  • Mu-law Encoding

對于均勻量化存在的問題則采樣非均勻量化解決。它的基本思想是對大信號進行壓縮而對小信號進行較大的放大。由于小信號的幅度得到較大的放大 ,從而使小信號的信噪比大為改善。目前常用的壓擴方法是對數(shù)型的 A律壓縮和 μ律壓縮 ,其中 μ律壓縮公式: y=ln(1+μx)/ln(1+μ) 其中 x 為歸一化的量化器輸入 , y 為歸一化的量化器輸出。常數(shù) μ愈大 ,則小信號的壓擴效益愈高 ,目前多采用 μ= 255

  • Local Position Embedding

截屏2023-06-27 16.32.20

Original Transformer:

  • neighbouring tokens have high similarity

  • further tokens low similarity

but it is different when we add position encoding

這個或許可以理解為什么action local position encoding 都一樣,因為不同的local position encoding會導(dǎo)致產(chǎn)生bias,導(dǎo)致原本的action token意義發(fā)生變化

Training Details

  • Hardware:16*16 TPU v3 slice

  • Timesteps:1M

  • Batch size:512

  • Token sequence length:1024

  • Training time:4 days

Datasets

截屏2023-06-27 19.23.29

可以看出weight有很大在3D gaming environmen 只有15%在text方面

Training Procedure

  • Mimic expert trajectories from SOTA or near-SOTA agents

  • Train only with episodes returns of at least 80% of expert return

Decision Transformers: Train on all kinds of episodes because we have the goal as total reward gained

但是Gato不能從壞的樣本中學(xué)習,因為Gato沒有任何形式的獎勵模式

所以Gato的根本問題是只從好的樣本學(xué)習這一方面

因為只從好的學(xué)習就不知道壞的是怎么發(fā)生的,如果隨機初始到一個bad zone 那么就很有可能會出問題,因為以前從來沒有見過這種情況

在Robocat其實對這方面做了很大改進

  • future work:Supposedly possible to learn via RL from scratch

extrinsic rewards: environment rewards

spare settings,takes very long to learn (maybe do intrinsic reward modelling)

本人對從0開始學(xué)習這一方面存疑

Is the general agent as good as the expert?

截屏2023-06-27 19.43.28

因環(huán)境而異,在比較困難的環(huán)境中表現(xiàn)其實沒那么好,但總體來說還是可以的

Is GATO scalable?

截屏2023-06-27 19.48.15

The answer is yes

  • Normalized return:

  • Each task calculate performance of model as percentage of expert score

  • Average percentage score across all tasks of a domain

  • Mean-aggregate percentage score across all domains

  • Increasing tokens trained = increased performance

  • Increasing model size = increased performance

Can GATO generalize(zero-shot)?

截屏2023-06-27 20.04.03
截屏2023-06-27 20.01.57
  • Not too well on held-out set,zero-shot transfer is a common problem with ML techniques in general

Generalizability experiments(few-shot)

  • should ideally learn by conditioning on different prompts

  • sequence lengths of tokenized demonstrations too long

  • Maximum contnt length insufficient to describe task

  • Instead,fine-tune agent's parameters on new task ,and evaluate fine-tuned model's performance on environment

  • 3 models

  • same domain only data:pretrained only from same domain as task to be fine-tuned on

  • no control data:pretrained only on non-control data

  • scratch:no pretraining at all

截屏2023-06-27 20.18.30
  • Non-image data: if we not have the right proprioception data,no use training on other data

  • Image data:different from different tasks


GATO調(diào)研報告的評論 (共 條)

分享到微博請遵守國家法律
镇安县| 镇安县| 时尚| 惠安县| 杂多县| 合水县| 定州市| 长兴县| 句容市| 肇源县| 彩票| 鄂托克前旗| 海口市| 紫云| 梁河县| 射洪县| 三门峡市| 新宁县| 上蔡县| 饶河县| 成安县| 长治市| 肃宁县| 西林县| 财经| 英吉沙县| 普宁市| 涟水县| 太白县| 微山县| 台湾省| 扶沟县| 望谟县| 武安市| 屯留县| 略阳县| 自贡市| 兴山县| 巨野县| 海门市| 庐江县|