最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

歡迎光臨散文網(wǎng) 會員登陸 & 注冊

Key Concepts 關(guān)鍵概念

2023-09-17 09:49 作者:LSC2049  | 我要投稿

Key Concepts?關(guān)鍵概念

On this page, we’ll cover the key concepts to help you understand how RLlib works and how to use it. In RLlib, you use?Algorithm’s to learn how to solve problem?environments. The algorithms use?policies?to select actions. Given a policy,?rollouts?throughout an?environment?produce?sample?batches?(or?trajectories) of experiences. You can also customize the?training_steps of your RL experiments.
在此頁面上,我們將介紹關(guān)鍵概念,以幫助您了解 RLlib 的工作原理以及如何使用它。在 RLlib 中,您可以使用?Algorithm?來學(xué)習(xí)如何解決問題?environments?。算法用于?policies?選擇操作。給定一個政策,?rollouts?貫穿一個?environment?(或?trajectories?)經(jīng)驗的產(chǎn)生?sample?batches?。您還可以自定義 RL 實驗的?training_steps。


Environments?環(huán)境

Solving a problem in RL begins with an?environment. In the simplest definition of RL:
解決 RL 中的問題始于環(huán)境。在RL的最簡單定義中:

An?agent?interacts with an?environment?and receives a reward.
代理與環(huán)境交互并獲得獎勵。

An environment in RL is the agent’s world, it is a simulation of the problem to be solved.
RL中的環(huán)境是代理的世界,它是要解決的問題的模擬。

An RLlib environment consists of:

RLlib 環(huán)境包括:

  1. all possible actions (action space)
    所有可能的操作(操作空間)

  2. a complete description of the environment, nothing hidden (state space)
    環(huán)境的完整描述,沒有任何隱藏(狀態(tài)空間)

  3. an observation by the agent of certain parts of the state (observation space)
    代理對狀態(tài)某些部分的觀察(觀察空間)

  4. reward, which is the only feedback the agent receives per action.
    獎勵,這是代理每次操作收到的唯一反饋。

The model that tries to maximize the expected sum over all future rewards is called a?policy. The policy is a function mapping the environment’s observations to an action to take, usually written?π?(s(t)) -> a(t). Below is a diagram of the RL iterative learning process.
嘗試最大化所有未來獎勵的預(yù)期總和的模型稱為策略。策略是一個將環(huán)境的觀察映射到要采取的行動的函數(shù),通常寫π (s(t)) -> a(t)。下面是 RL 迭代學(xué)習(xí)過程的示意圖。

The RL simulation feedback loop repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies’ weights are kept in sync. Thereby, the collected environment data contains observations, taken actions, received rewards and so-called?done?flags, indicating the boundaries of different episodes the agents play through in the simulation.

RL 模擬反饋循環(huán)針對一個(單代理案例)或多個(多代理案例)策略重復(fù)收集數(shù)據(jù),根據(jù)這些收集的數(shù)據(jù)訓(xùn)練策略,并確保策略的權(quán)重保持同步。因此,收集的環(huán)境數(shù)據(jù)包含觀察、采取的行動、獲得的獎勵和所謂的完成標(biāo)志,指示代理在模擬中經(jīng)歷的不同情節(jié)的邊界。

The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an?episode, or in RLlib, a?rollout. The most common API to define environments is the?Farama-Foundation Gymnasium?API, which we also use in most of our examples.
動作 - >獎勵 - >下一個狀態(tài) - >訓(xùn)練 - >重復(fù)的模擬迭代,直到結(jié)束狀態(tài),稱為一集,或在 RLlib 中稱為推出。定義環(huán)境的最常見 API 是 Farama-Foundation Gymnasium API,我們在大多數(shù)示例中也使用它。

Algorithms?算法

Algorithms bring all RLlib components together, making learning of different tasks accessible via RLlib’s Python API and its command line interface (CLI). Each?Algorithm?class is managed by its respective?AlgorithmConfig, for example to configure a?PPO?instance, you should use the?PPOConfig?class. An?Algorithm?sets up its rollout workers and optimizers, and collects training metrics.?Algorithms?also implement the?Tune Trainable API?for easy experiment management.
算法將所有RLlib組件組合在一起,可以通過RLlib的Python API及其命令行界面(CLI)學(xué)習(xí)不同的任務(wù)。每個?Algorithm?類都由其各自的?AlgorithmConfig?管理,例如,要配置實例?PPO?,您應(yīng)該使用該?PPOConfig?類。An?Algorithm?設(shè)置其推出工作線程和優(yōu)化器,并收集訓(xùn)練指標(biāo)。?Algorithms?還要實現(xiàn) Tune 可訓(xùn)練 API,以便輕松管理實驗。

You have three ways to interact with an algorithm. You can use the basic Python API or the command line to train it, or you can use Ray Tune to tune hyperparameters of your reinforcement learning algorithm. The following example shows three equivalent ways of interacting with?PPO, which implements the proximal policy optimization algorithm in RLlib.
您可以通過三種方式與算法進(jìn)行交互。您可以使用基本的 Python API 或命令行來訓(xùn)練它,也可以使用 Ray Tune 來調(diào)整強化學(xué)習(xí)算法的超參數(shù)。以下示例顯示了與 交互?PPO?的三種等效方式,該方式在 RLlib 中實現(xiàn)了近端策略優(yōu)化算法。

Basic RLlib Algorithm?基本 RLlib 算法

RLlib Algorithms and Tune?RLlib 算法和調(diào)諧

RLlib?Algorithm classes?coordinate the distributed workflow of running rollouts and optimizing policies. Algorithm classes leverage parallel iterators to implement the desired computation pattern. The following figure shows?synchronous sampling, the simplest of?these patterns:
RLlib 算法類協(xié)調(diào)運行推出和優(yōu)化策略的分布式工作流。算法類利用并行迭代器來實現(xiàn)所需的計算模式。下圖顯示了同步采樣,這是這些模式中最簡單的一種:


Synchronous Sampling (e.g., A2C, PG, PPO)
同步采樣(例如,A2C、PG、PPO)#

RLlib uses?Ray actors?to scale training from a single core to many thousands of cores in a cluster. You can?configure the parallelism?used for training by changing the?num_workers?parameter. Check out our?scaling guide?for more details here.
RLlib 使用 Ray Actor 將訓(xùn)練從單個內(nèi)核擴展到集群中的數(shù)千個內(nèi)核。您可以通過更改?num_workers?參數(shù)來配置用于訓(xùn)練的并行度。在此處查看我們的擴展指南以獲取更多詳細(xì)信息。

RL Modules?強化學(xué)習(xí)模塊

RLModules?are framework-specific neural network containers. In a nutshell, they carry the neural networks and define how to use them during three phases that occur in reinforcement learning: Exploration, inference and training. A minimal RL Module can contain a single neural network and define its exploration-, inference- and training logic to only map observations to actions. Since RL Modules can map observations to actions, they naturally implement reinforcement learning policies in RLlib and can therefore be found in the?RolloutWorker, where their exploration and inference logic is used to sample from an environment. The second place in RLlib where RL Modules commonly occur is the?Learner, where their training logic is used in training the neural network. RL Modules extend to the multi-agent case, where a single?MultiAgentRLModule?contains multiple RL Modules. The following figure is a rough sketch of how the above can look in practice:
RLModules 是特定于框架的神經(jīng)網(wǎng)絡(luò)容器。簡而言之,它們攜帶神經(jīng)網(wǎng)絡(luò),并定義如何在強化學(xué)習(xí)中發(fā)生的三個階段使用它們:探索、推理和訓(xùn)練。最小 RL 模塊可以包含單個神經(jīng)網(wǎng)絡(luò),并定義其探索、推理和訓(xùn)練邏輯,以僅將觀察結(jié)果映射到操作。由于 RL 模塊可以將觀察映射到操作,因此它們自然會在 RLlib 中實現(xiàn)強化學(xué)習(xí)策略,因此可以在?RolloutWorker?中找到,其中它們的探索和推理邏輯用于從環(huán)境中采樣。RLlib中經(jīng)常出現(xiàn)RL模塊的第二個地方是?Learner?,它們的訓(xùn)練邏輯用于訓(xùn)練神經(jīng)網(wǎng)絡(luò)。RL 模塊擴展到多代理案例,其中單個?MultiAgentRLModule?包含多個 RL 模塊。下圖是上述內(nèi)容在實踐中的粗略草圖:


Note?注意

RL Modules are currently in alpha stage. They are wrapped in legacy?Policy?objects to be used in?RolloutWorker?for sampling. This should be transparent to the user, but the following?Policy Evaluation?section still refers to these legacy Policy objects.
RL 模塊目前處于 alpha 階段。它們包裝在要用于?RolloutWorker?采樣的舊?Policy?對象中。這應(yīng)該對用戶透明,但以下策略評估部分仍引用這些舊策略對象。

Policy Evaluation?政策評估

Given an environment and policy, policy evaluation produces?batches?of experiences. This is your classic “environment interaction loop”. Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a?RolloutWorker?class that manages all of this, and this class is used in most RLlib algorithms.
給定環(huán)境和政策,政策評估會產(chǎn)生一批批的經(jīng)驗。這是您的經(jīng)典“環(huán)境交互循環(huán)”。高效的策略評估可能很難正確進(jìn)行,尤其是在利用矢量化、RNN 或在多智能體環(huán)境中運行時。RLlib 提供了一個管理所有這些的 RolloutWorker 類,該類用于大多數(shù) RLlib 算法。

You can use rollout workers standalone to produce batches of experiences. This can be done by calling?worker.sample()?on a worker instance, or?worker.sample.remote()?in parallel on worker instances created as Ray actors (see?WorkerSet).
您可以獨立使用推出輔助角色來生成批量體驗。這可以通過調(diào)用?worker.sample()?工作線程實例來完成,也可以在?worker.sample.remote()?創(chuàng)建為 Ray actor 的工作線程實例上并行完成(請參閱 WorkerSet)。

Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts:
下面是創(chuàng)建一組推出輔助角色并使用它們并行收集體驗的示例。軌跡被串聯(lián)起來,策略在軌跡批次上學(xué)習(xí),然后我們將策略權(quán)重廣播給工作人員,以便下一輪推出:

Sample Batches?樣品批次

Whether running in a single process or a?large cluster, all data in RLlib is interchanged in the form of?sample batches. Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size?rollout_fragment_length?from rollout workers, and concatenates one or more of these batches into a batch of size?train_batch_size?that is the input to SGD.
無論是在單個進(jìn)程中還是在大型集群中運行,RLlib 中的所有數(shù)據(jù)都以樣本批次的形式進(jìn)行交換。樣本批次對軌跡的一個或多個片段進(jìn)行編碼。通常,RLlib 從推出工作線程收集大小?rollout_fragment_length?的批次,并將其中一個或多個批次連接成作為 SGD 輸入?train_batch_size?的大小批次。

A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network:
匯總時,典型的示例批次如下所示。由于所有值都保存在數(shù)組中,因此可以有效地編碼和通過網(wǎng)絡(luò)傳輸:

In?multi-agent mode, sample batches are collected separately for each individual policy. These batches are wrapped up together in a?MultiAgentBatch, serving as a container for the individual agents’ sample batches.
在多代理模式下,將為每個單獨的策略單獨收集樣本批次。這些批次一起包裝在一個?MultiAgentBatch?中,作為各個試劑樣品批次的容器。


Training Step Method (Algorithm.training_step())
訓(xùn)練步驟方法 (?Algorithm.training_step()?)

Note?注意

It’s important to have a good understanding of the basic?ray core methods?before reading this section. Furthermore, we utilize concepts such as the?SampleBatch?(and its more advanced sibling: the?MultiAgentBatch),?RolloutWorker, and?Algorithm, which can be read about on this page and the?rollout worker reference docs.
在閱讀本節(jié)之前,充分了解基本的射線核心方法非常重要。此外,我們還利用了?SampleBatch?諸如(及其更高級的兄弟?MultiAgentBatch?:)、?RolloutWorker?和?Algorithm?等概念,這些概念可以在此頁面和推出輔助角色參考文檔中閱讀。

Finally, developers who are looking to implement custom algorithms should familiarize themselves with the?Policy?and?Model?classes.
最后,希望實現(xiàn)自定義算法的開發(fā)人員應(yīng)熟悉策略和模型類。


What is it??這是什么?

The?training_step()?method of the?Algorithm?class defines the repeatable execution logic that sits at the core of any algorithm. Think of it as the python implementation of an algorithm’s pseudocode you can find in research papers. You can use?training_step()?to express how you want to coordinate the collection of samples from the environment(s), the movement of this data to other parts of the algorithm, and the updates and management of your policy’s weights across the different distributed components.
Algorithm?該類?training_step()?的方法定義了位于任何算法核心的可重復(fù)執(zhí)行邏輯??梢詫⑵湟暈樗惴▊未a的python實現(xiàn),您可以在研究論文中找到。您可以使用?training_step()?來表示您希望如何協(xié)調(diào)從環(huán)境中收集樣本、將此數(shù)據(jù)移動到算法的其他部分,以及跨不同分布式組件更新和管理策略權(quán)重。

In short, a developer will need to override/modify the ``training_step`` method if they want to make custom changes to an existing algorithm, write their own algo from scratch, or implement some algorithm from a paper.
簡而言之,如果開發(fā)人員想要對現(xiàn)有算法進(jìn)行自定義更改,從頭開始編寫自己的算法或從論文中實現(xiàn)某些算法,則需要覆蓋/修改“training_step”方法。

When is?training_step()?invoked?
training_step()?何時調(diào)用?

The?Algorithm’s?training_step()?method is called:
Algorithm?的方法?training_step()?稱為:

  1. when the?train()?method of?Algorithm?is called (e.g. “manually” by a user that has constructed an?Algorithm?instance).
    train()?當(dāng)調(diào)用方法?Algorithm?時(例如,由已構(gòu)造?Algorithm?實例的用戶“手動”調(diào)用)。

  2. when an RLlib Algorithm is being run by Ray Tune.?training_step()?will be continuously called till the?ray tune stop criteria?is met.
    當(dāng) RLlib 算法由 Ray Tune 運行時。?training_step()?將持續(xù)調(diào)用,直到滿足光線調(diào)諧停止條件。

Key Subconcepts?關(guān)鍵子概念

In the following, using the example of VPG (“vanilla policy gradient”), we will try to illustrate how to use the?training_step()?method to implement this algorithm in RLlib. The “vanilla policy gradient” algo can be thought of as a sequence of repeating steps, or?dataflow, of:
在下文中,以 VPG(“香草策略梯度”)為例,我們將嘗試說明如何使用該方法?training_step()?在 RLlib 中實現(xiàn)此算法。“香草策略梯度”算法可以被認(rèn)為是一系列重復(fù)步驟或數(shù)據(jù)流:

  1. Sampling (to collect data from an env)
    采樣(從環(huán)境中收集數(shù)據(jù))

  2. Updating the Policy (to learn a behavior)
    更新策略(了解行為)

  3. Broadcasting the updated Policy’s weights (to make sure all distributed units have the same weights again)
    廣播更新的策略權(quán)重(以確保所有分布式單元再次具有相同的權(quán)重)

  4. Metrics reporting (returning relevant stats from all the above operations with regards to performance and runtime)
    指標(biāo)報告(返回上述所有操作中有關(guān)性能和運行時的相關(guān)統(tǒng)計信息)

An example implementation of VPG could look like the following:
VPG 的示例實現(xiàn)可能如下所示:


Note?注意

Note that the?training_step?method is deep learning framework agnostic. This means that you should not write PyTorch- or TensorFlow specific code inside this module, allowing for a strict separation of concerns and enabling us to use the same?training_step()?method for both TF- and PyTorch versions of your algorithms. DL framework specific code should only be added to the?Policy?(e.g. in its loss function(s)) and?Model?(e.g. tf.keras or torch.nn neural network code) classes.
請注意,該方法?training_step?與深度學(xué)習(xí)框架無關(guān)。這意味著您不應(yīng)該在此模塊中編寫特定于 PyTorch 或 TensorFlow 的代碼,從而允許嚴(yán)格分離關(guān)注點,并使我們能夠?qū)λ惴ǖ?TF 和 PyTorch 版本使用相同的?training_step()?方法。DL 框架特定的代碼只能添加到策略(例如,在其損失函數(shù)中)和模型(例如 tf.keras 或 torch.nn 神經(jīng)網(wǎng)絡(luò)代碼)類中。

et’s further break down our above?training_step()?code. In the first step, we collect trajectory data from the environment(s):
讓我們進(jìn)一步分解上面的?training_step()?代碼。第一步,我們從環(huán)境中收集軌跡數(shù)據(jù):


Here,?self.workers?is a set of?RolloutWorkers?that are created in the?Algorithm’s?setup()?method (prior to calling?training_step()). This?WorkerSet?is covered in greater depth on the?WorkerSet documentation page. The utility function?synchronous_parallel_sample?can be used for parallel sampling in a blocking fashion across multiple rollout workers (returns once all rollout workers are done sampling). It returns one final MultiAgentBatch resulting from concatenating n smaller MultiAgentBatches (exactly one from each remote rollout worker).
這里,?self.workers?是在?Algorithm?的方法?setup()?中創(chuàng)建的一組?RolloutWorkers?(在調(diào)用?training_step()?之前)。?WorkerSet?工作線程集文檔頁面上對此進(jìn)行了更深入的介紹。實用程序函數(shù)?synchronous_parallel_sample?可用于以阻塞方式跨多個部署工作線程進(jìn)行并行采樣(在所有部署工作線程完成采樣后返回)。它返回一個由連接 n 個較小的 MultiAgentBatch(正好一個來自每個遠(yuǎn)程推出工作線程)生成的最后一個 MultiAgentBatch。

The?train_batch?is then passed to another utility function:?train_one_step.
然后?train_batch?傳遞給另一個實用程序函數(shù):?train_one_step?。

Methods like?train_one_step?and?multi_gpu_train_one_step?are used for training our Policy. Further documentation with examples can be found on the?train ops documentation page.
像?train_one_step?和?multi_gpu_train_one_step?用于培訓(xùn)我們的政策的方法。有關(guān)示例的更多文檔,請參閱訓(xùn)練操作文檔頁面。

The training updates on the policy are only applied to its version inside?self.workers.local_worker. Note that each WorkerSet has n remote workers and exactly one “l(fā)ocal worker” and that each worker (remote and local ones) holds a copy of the policy.
策略上的培訓(xùn)更新僅適用于 中的?self.workers.local_worker?版本。請注意,每個 WorkerSet 都有 n 個遠(yuǎn)程工作線程和一個“本地工作線程”,并且每個工作線程(遠(yuǎn)程和本地工作線程)都持有策略的副本。

Now that we updated the local policy (the copy in?self.workers.local_worker), we need to make sure that the copies in all remote workers (self.workers.remote_workers) have their weights synchronized (from the local one):
現(xiàn)在我們更新了本地策略(中?self.workers.local_worker?的副本),我們需要確保所有遠(yuǎn)程工作人員 (?self.workers.remote_workers?) 中的副本都同步了它們的權(quán)重(從本地副本):

By calling?self.workers.sync_weights(), weights are broadcasted from the local worker to the remote workers. See?rollout worker reference docs?for further details.
通過調(diào)用?self.workers.sync_weights()?,權(quán)重從本地工作人員廣播到遠(yuǎn)程工作人員。有關(guān)更多詳細(xì)信息,請參閱推出輔助角色參考文檔。

A dictionary is expected to be returned that contains the results of the training update. It maps keys of type?str?to values that are of type?float?or to dictionaries of the same form, allowing for a nested structure.
應(yīng)返回包含訓(xùn)練更新結(jié)果的字典。它將類型的鍵映射到類型的?str?float?值或相同形式的字典,從而允許嵌套結(jié)構(gòu)。

For example, a results dictionary could map policy_ids to learning and sampling statistics for that policy:
例如,結(jié)果字典可以將policy_ids映射到該策略的學(xué)習(xí)和采樣統(tǒng)計信息:

Training Step Method Utilities
訓(xùn)練步驟方法實用程序

RLlib provides a collection of utilities that abstract away common tasks in RL training. In particular, if you would like to work with the various?training_step?methods or implement your own, it’s recommended to familiarize yourself first with these following concepts here:
RLlib 提供了一組實用程序,這些實用程序抽象出 RL 訓(xùn)練中的常見任務(wù)。特別是,如果您想使用各種?training_step?方法或?qū)崿F(xiàn)自己的方法,建議首先熟悉以下概念:

Sample Batch:?SampleBatch?and?MultiAgentBatch?are the two types that we use for storing trajectory data in RLlib. All of our RLlib abstractions (policies, replay buffers, etc.) operate on these two types.
樣本批處理:?SampleBatch?是我們?MultiAgentBatch?用于在 RLlib 中存儲軌跡數(shù)據(jù)的兩種類型。我們所有的 RLlib 抽象(策略、重放緩沖區(qū)等)都對這兩種類型進(jìn)行操作。

Rollout Workers: Rollout workers are an abstraction that wraps a policy (or policies in the case of multi-agent) and an environment. From a high level, we can use rollout workers to collect experiences from the environment by calling their?sample()?method and we can train their policies by calling their?learn_on_batch()?method. By default, in RLlib, we create a set of workers that can be used for sampling and training. We create a?WorkerSet?object inside of?setup?which is called when an RLlib algorithm is created. The?WorkerSet?has a?local_worker?and?remote_workers?if?num_workers?>?0?in the experiment config. In RLlib we typically use?local_worker?for training and?remote_workers?for sampling.
推出工作線程:部署工作線程是包裝策略(如果是多代理)和環(huán)境的抽象。從高級別來看,我們可以使用推出工作程序通過調(diào)用其方法從環(huán)境中收集經(jīng)驗,并且可以通過調(diào)用其?learn_on_batch()?方法來訓(xùn)練其?sample()?策略。默認(rèn)情況下,在 RLlib 中,我們創(chuàng)建一組可用于采樣和訓(xùn)練的工作線程。我們創(chuàng)建一個對象,在創(chuàng)建 RLlib 算法時調(diào)用?setup?該?WorkerSet?對象。在?WorkerSet?實驗配置中有一個?local_worker?和?remote_workers?如果?num_workers?>?0?。在 RLlib 中,我們通常用于?local_worker?訓(xùn)練和?remote_workers?采樣。

Train Ops: These are methods that improve the policy and update workers. The most basic operator,?train_one_step, takes in as input a batch of experiences and emits a?ResultDict?with metrics as output. For training with GPUs, use?multi_gpu_train_one_step. These methods use the?learn_on_batch?method of rollout workers to complete the training update.
培訓(xùn)操作:這些是改進(jìn)策略和更新工作人員的方法。最基本的運算符?train_one_step?將一批體驗作為輸入,并發(fā)出帶有指標(biāo)的 作為?ResultDict?輸出。要使用 GPU 進(jìn)行訓(xùn)練,請使用?multi_gpu_train_one_step?。這些方法使用推出輔助角色?learn_on_batch?的方法完成培訓(xùn)更新。

Replay Buffers: RLlib provides?a collection?of replay buffers that can be used for storing and sampling experiences.
重播緩沖區(qū):RLlib 提供可用于存儲和采樣體驗的重播緩沖區(qū)集合。


Key Concepts 關(guān)鍵概念的評論 (共 條)

分享到微博請遵守國家法律
信阳市| 竹山县| 大名县| 资讯 | 定南县| 维西| 厦门市| 通州市| 福州市| 磴口县| 新乡市| 监利县| 泾源县| 当雄县| 北海市| 宁陕县| 夏邑县| 商洛市| 石门县| 昭平县| 永德县| 莱芜市| 淅川县| 蕉岭县| 舞阳县| 嘉祥县| 松滋市| 昌乐县| 屯门区| 偃师市| 景德镇市| 多伦县| 和林格尔县| 内丘县| 都兰县| 珲春市| 江阴市| 汝南县| 平江县| 阿拉善盟| 和顺县|