散文網(wǎng) » 生活 »日常 » 訓(xùn)練玩馬里奧的 RL 代理-翻譯

訓(xùn)練玩馬里奧的 RL 代理-翻譯

2023-09-19 12:07 作者:LSC2049 0人讀過(guò) | 我要投稿

Authors:?Yuansong Feng,?Suraj Subramanian,?Howard Wang,?Steven Guo.
作者：馮元松，蘇拉杰·蘇布拉馬尼安，霍華德·王，史蒂文·郭。

This tutorial walks you through the fundamentals of Deep Reinforcement Learning. At the end, you will implement an AI-powered Mario (using?Double Deep Q-Networks) that can play the game by itself.
本教程將引導(dǎo)您了解深度強(qiáng)化學(xué)習(xí)的基礎(chǔ)知識(shí)。最后，您將實(shí)現(xiàn)一個(gè)人工智能驅(qū)動(dòng)的馬里奧（使用雙深度Q網(wǎng)絡(luò)），可以自己玩游戲。

Although no prior knowledge of RL is necessary for this tutorial, you can familiarize yourself with these RL?concepts, and have this handy?cheatsheet?as your companion. The full code is available?here.
盡管本教程不需要 RL 的先驗(yàn)知識(shí)，但您可以熟悉這些 RL 概念，并將此方便的備忘單作為您的伴侶。完整的代碼可在此處獲得。

RL Definitions?強(qiáng)化學(xué)習(xí)定義

Environment?The world that an agent interacts with and learns from.
環(huán)境代理與之交互并從中學(xué)習(xí)的世界。

Action?a?: How the Agent responds to the Environment. The set of all possible Actions is called?action-space.
動(dòng)作 a ：代理如何響應(yīng)環(huán)境。所有可能的動(dòng)作的集合稱為動(dòng)作空間。

State?s?: The current characteristic of the Environment. The set of all possible States the Environment can be in is called?state-space.
狀態(tài)?s?：環(huán)境的當(dāng)前特征。環(huán)境可以處于的所有可能狀態(tài)的集合稱為狀態(tài)空間。

Reward?r?: Reward is the key feedback from Environment to Agent. It is what drives the Agent to learn and to change its future action. An aggregation of rewards over multiple time steps is called?Return.
獎(jiǎng)勵(lì)：獎(jiǎng)勵(lì)?r?是從環(huán)境到代理的關(guān)鍵反饋。它驅(qū)使代理學(xué)習(xí)并改變其未來(lái)行動(dòng)。多個(gè)時(shí)間步長(zhǎng)的獎(jiǎng)勵(lì)聚合稱為返回。

Optimal Action-Value function?)Q?(s,a)?: Gives the expected return if you start in state??s, take an arbitrary action??a, and then for each future time step take the action that maximizes returns.??Q?can be said to stand for the “quality” of the action in a state. We try to approximate this function.
最優(yōu)動(dòng)作-價(jià)值函數(shù)?Q?(s,a)?：如果你從狀態(tài)??s?開(kāi)始，采取任意動(dòng)作??a?，然后對(duì)未來(lái)的每個(gè)時(shí)間步長(zhǎng)采取最大化回報(bào)的行動(dòng)，則給出預(yù)期回報(bào)。?Q?可以說(shuō)代表著一種狀態(tài)下動(dòng)作的“質(zhì)量”。我們嘗試近似這個(gè)函數(shù)。

Environment?環(huán)境

Initialize Environment?初始化環(huán)境

In Mario, the environment consists of tubes, mushrooms and other components.
在馬里奧中，環(huán)境由管子、蘑菇和其他組件組成。

When Mario makes an action, the environment responds with the changed (next) state, reward and other info.
當(dāng)馬里奧采取行動(dòng)時(shí)，環(huán)境會(huì)用更改的（下一個(gè)）狀態(tài)、獎(jiǎng)勵(lì)和其他信息進(jìn)行響應(yīng)。

Preprocess Environment?預(yù)處理環(huán)境

Environment data is returned to the agent in?next_state. As you saw above, each state is represented by a?[3,?240,?256]?size array. Often that is more information than our agent needs; for instance, Mario’s actions do not depend on the color of the pipes or the sky!
環(huán)境數(shù)據(jù)將在下一個(gè)狀態(tài)next_state返回到?代理。正如您在上面看到的，每個(gè)狀態(tài)都由一個(gè)?[3,?240,?256]?大小數(shù)組表示。通常，這些信息比我們的代理需要的要多;例如，馬里奧的行動(dòng)不依賴于管道的顏色或天空！

We use?Wrappers?to preprocess environment data before sending it to the agent.
我們使用包裝器在將環(huán)境數(shù)據(jù)發(fā)送到代理之前對(duì)其進(jìn)行預(yù)處理。

GrayScaleObservation?is a common wrapper to transform an RGB image to grayscale; doing so reduces the size of the state representation without losing useful information. Now the size of each state:?[1,?240,?256]
GrayScaleObservation?是將 RGB 圖像轉(zhuǎn)換為灰度的常見(jiàn)包裝器;這樣做可以減小狀態(tài)表示形式的大小，而不會(huì)丟失有用的信息?，F(xiàn)在每個(gè)狀態(tài)的大小：?[1,?240,?256]

ResizeObservation?downsamples each observation into a square image. New size:?[1,?84,?84]
ResizeObservation?將每個(gè)觀測(cè)值下采樣為正方形圖像。新尺寸：?[1,?84,?84]

SkipFrame?is a custom wrapper that inherits from?gym.Wrapper?and implements the?step()?function. Because consecutive frames don’t vary much, we can skip n-intermediate frames without losing much information. The n-th frame aggregates rewards accumulated over each skipped frame.
SkipFrame?是從函數(shù)繼承并?gym.Wrapper?實(shí)現(xiàn)?step()?函數(shù)的自定義包裝器。因?yàn)檫B續(xù)幀變化不大，我們可以跳過(guò) n 中間幀而不會(huì)丟失太多信息。第 n 幀聚合每個(gè)跳過(guò)幀累積的獎(jiǎng)勵(lì)。

FrameStack?is a wrapper that allows us to squash consecutive frames of the environment into a single observation point to feed to our learning model. This way, we can identify if Mario was landing or jumping based on the direction of his movement in the previous several frames.
FrameStack?是一個(gè)包裝器，允許我們將環(huán)境的連續(xù)幀壓縮到單個(gè)觀察點(diǎn)中，以提供給我們的學(xué)習(xí)模型。這樣，我們可以根據(jù)馬里奧在前幾幀中的運(yùn)動(dòng)方向來(lái)確定他是在著陸還是跳躍。

After applying the above wrappers to the environment, the final wrapped state consists of 4 gray-scaled consecutive frames stacked together, as shown above in the image on the left. Each time Mario makes an action, the environment responds with a state of this structure. The structure is represented by a 3-D array of size?[4,?84,?84].
將上述包裝器應(yīng)用于環(huán)境后，最終包裝狀態(tài)由堆疊在一起的 4 個(gè)灰度連續(xù)幀組成，如上圖左側(cè)所示。每次馬里奧做出動(dòng)作時(shí)，環(huán)境都會(huì)以這種結(jié)構(gòu)的狀態(tài)做出響應(yīng)。該結(jié)構(gòu)由大小?[4,?84,?84]?的 3-D 數(shù)組表示。

Agent?代理

We create a class?Mario?to represent our agent in the game. Mario should be able to:
我們創(chuàng)建一個(gè)類?Mario?來(lái)表示游戲中的代理。馬里奧應(yīng)該能夠：

Act?according to the optimal action policy based on the current state (of the environment).
根據(jù)基于當(dāng)前狀態(tài)（環(huán)境）的最佳操作策略進(jìn)行操作。
Remember?experiences. Experience = (current state, current action, reward, next state). Mario?caches?and later?recalls?his experiences to update his action policy.
記住經(jīng)歷。經(jīng)驗(yàn) =（當(dāng)前狀態(tài)、當(dāng)前操作、獎(jiǎng)勵(lì)、下一個(gè)狀態(tài)）。馬里奧緩存并隨后回憶起他的經(jīng)歷以更新他的行動(dòng)策略。
Learn?a better action policy over time
隨著時(shí)間的推移，了解更好的行動(dòng)策略

In the following sections, we will populate Mario’s parameters and define his functions.
在以下部分中，我們將填充馬里奧的參數(shù)并定義他的函數(shù)。

Act

For any given state, an agent can choose to do the most optimal action (exploit) or a random action (explore).
對(duì)于任何給定狀態(tài)，代理可以選擇執(zhí)行最佳操作（利用）或隨機(jī)操作（探索）。

Mario randomly explores with a chance of?self.exploration_rate; when he chooses to exploit, he relies on?MarioNet?(implemented in?Learn?section) to provide the most optimal action.
馬里奧隨機(jī)探索，有幾率;?self.exploration_rate?當(dāng)他選擇利用時(shí)，他依靠?MarioNet?（在?Learn?部分中實(shí)現(xiàn)）來(lái)提供最佳操作。

Cache and Recall?緩存和調(diào)用

These two functions serve as Mario’s “memory” process.
這兩個(gè)功能充當(dāng)馬里奧的“記憶”過(guò)程。

cache(): Each time Mario performs an action, he stores the?experience?to his memory. His experience includes the current?state,?action?performed,?reward?from the action, the?next state, and whether the game is?done.
cache()?：每次馬里奧執(zhí)行一個(gè)動(dòng)作時(shí)，他都會(huì)將存儲(chǔ)到?experience?他的記憶中。他的經(jīng)驗(yàn)包括當(dāng)前狀態(tài)、執(zhí)行的操作、操作的獎(jiǎng)勵(lì)、下一個(gè)狀態(tài)以及游戲是否完成。

recall(): Mario randomly samples a batch of experiences from his memory, and uses that to learn the game.
recall()?：馬里奧從他的記憶中隨機(jī)抽取一批經(jīng)驗(yàn)，并用它來(lái)學(xué)習(xí)游戲。

Learn?學(xué)習(xí)

Mario uses the?DDQN algorithm?under the hood. DDQN uses two ConvNets -?Qonline?and?Qtarget?- that independently approximate the optimal action-value function.
馬里奧在引擎蓋下使用DDQN算法。DDQN 使用兩個(gè) ConvNet -?Qonline?和?Qtarget?- 獨(dú)立地近似最優(yōu)動(dòng)作值函數(shù)。

In our implementation, we share feature generator?features?across?Qonline?and?Qtarget, but maintain separate FC classifiers for each.?θtarget?(the parameters of?Qtarget) is frozen to prevent updating by backprop. Instead, it is periodically synced with?θonline?(more on this later).
在我們的實(shí)現(xiàn)中，我們?cè)?和Qtarget?之間?Qonline?共享特征生成器?features?，但為每個(gè)分類器維護(hù)單獨(dú)的 FC 分類器。?θtarget?（）的Qtarget?參數(shù)被凍結(jié)以防止反向傳播更新。相反，它會(huì)定期與（稍后會(huì)詳細(xì)介紹）同步θonline?。

Neural Network?神經(jīng)網(wǎng)絡(luò)

TD Estimate & TD Target
TD估算和TD目標(biāo)

Two values are involved in learning:
學(xué)習(xí)涉及兩個(gè)價(jià)值觀：

TD Estimate?- the predicted optimal??Q??for a given state?s
TD估算 - 給定狀態(tài)下?s?的預(yù)測(cè)最佳值??Q?

TDe=Qonline?(s,a)

TD Target?- aggregation of current reward and the estimated??Q??in the next state?′s′
TD目標(biāo) - 當(dāng)前獎(jiǎng)勵(lì)和下一狀態(tài)下′s′?的估計(jì)??Q??值匯總

a′=argmaxaQonline(s′,a)TDt=r+γQtarget?(s′,a′)

Because we don’t know what next action?′a′?will be, we use the action?′a′?maximizes?Qonline?in the next state?′s′.
因?yàn)槲覀儾恢老乱粋€(gè)動(dòng)作會(huì)是什么，所以我們?cè)谙乱粋€(gè)狀態(tài)下使用動(dòng)作??最大化?Qonline?。?′s′

Notice we use the?@torch.no_grad()?decorator on?td_target()?to disable gradient calculations here (because we don’t need to backpropagate on?θtarget).
請(qǐng)注意，我們使用 @torch.no_grad（）裝飾器來(lái)?td_target()?禁用梯度計(jì)算（因?yàn)槲覀儾恍枰聪騻鞑?θtarget?）。

Updating the model?更新模型

As Mario samples inputs from his replay buffer, we compute?TDt?and?TDe?and backpropagate this loss downQonline?to update its parameters?θonline?(α?is the learning rate?lr?passed to the?optimizer)
當(dāng)Mario從他的重放緩沖區(qū)采樣輸入時(shí)，我們計(jì)算TDt?并TDe?反向傳播此損失Qonline?以更新其參數(shù)θonline?（α?是傳遞給的?optimizer?學(xué)習(xí)率?lr?）

θonline←θonline+α?(TDe?TDt)

θtarget?does not update through backpropagation. Instead, we periodically copy?θonline?to?θtarget
θtarget?不通過(guò)反向傳播進(jìn)行更新。相反，我們會(huì)定期復(fù)制到θonline?θtarget

θtarget←θonline

Save checkpoint?保存檢查點(diǎn)

Putting it all together
將一切整合在一起

Logging?日志

Let’s play!?我們來(lái)玩吧！

In this example we run the training loop for 40 episodes, but for Mario to truly learn the ways of his world, we suggest running the loop for at least 40,000 episodes!
在這個(gè)例子中，我們運(yùn)行了 40 集的訓(xùn)練循環(huán)，但為了讓馬里奧真正了解他的世界的方式，我們建議運(yùn)行至少 40,000 集的循環(huán)！

Conclusion?結(jié)論

In this tutorial, we saw how we can use PyTorch to train a game-playing AI. You can use the same methods to train an AI to play any of the games at the?OpenAI gym. Hope you enjoyed this tutorial, feel free to reach us at?our github!
在本教程中，我們了解了如何使用 PyTorch 來(lái)訓(xùn)練玩游戲的 AI。您可以使用相同的方法來(lái)訓(xùn)練 AI 在 OpenAI Gym玩任何游戲。希望您喜歡本教程，請(qǐng)隨時(shí)通過(guò)我們的 github 與我們聯(lián)系！

Total running time of the script:?( 1 minutes 50.444 seconds)
腳本的總運(yùn)行時(shí)間：（1 分 50.444 秒）

原文：https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html

Github：https://github.com/yfeng997/MadMario

標(biāo)簽：