@@ -39,6 +39,10 @@ as well as multi-GPU training on multi-node (GPU) clusters when using the `Anysc
39
39
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
40
40
| :ref: `BC (Behavior Cloning) <bc >` | |single_agent | | |multi_gpu | |multi_node_multi_gpu | | |cont_actions | |discr_actions | |
41
41
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
42
+ | :ref: `CQL (Conservative Q-Learning) <cql >` | |single_agent | | |multi_gpu | |multi_node_multi_gpu | | |cont_actions | |
43
+ +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
44
+ | :ref: `IQL (Implicit Q-Learning) <iql >` | |single_agent | | |multi_gpu | |multi_node_multi_gpu | | |cont_actions | |
45
+ +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
42
46
| :ref: `MARWIL (Monotonic Advantage Re-Weighted Imitation Learning) <marwil >` | |single_agent | | |multi_gpu | |multi_node_multi_gpu | | |cont_actions | |discr_actions | |
43
47
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
44
48
| **Algorithm Extensions and -Plugins ** |
@@ -183,7 +187,7 @@ Asynchronous Proximal Policy Optimization (APPO)
183
187
In a training iteration, APPO requests samples from all EnvRunners asynchronously and the collected episode
184
188
samples are returned to the main algorithm process as Ray references rather than actual objects available on the local process.
185
189
APPO then passes these episode references to the Learners for asynchronous updates of the model.
186
- RLlib doesn't always synch back the weights to the EnvRunners right after a new model version is available.
190
+ RLlib doesn't always sync back the weights to the EnvRunners right after a new model version is available.
187
191
To account for the EnvRunners being off-policy, APPO uses a procedure called v-trace,
188
192
`described in the IMPALA paper <https://arxiv.org/abs/1802.01561 >`__.
189
193
APPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners
@@ -363,6 +367,30 @@ Conservative Q-Learning (CQL)
363
367
:members: training
364
368
365
369
370
+ .. _iql :
371
+
372
+ Implicit Q-Learning (IQL)
373
+ -------------------------
374
+ `[paper] <https://arxiv.org/abs/2110.06169 >`__
375
+ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/iql/iql.py >`__
376
+
377
+ **IQL architecture: ** IQL (Implicit Q-Learning) is an offline RL algorithm that never needs to evaluate actions outside of
378
+ the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through
379
+ generalization. Instead of standard TD-error minimization, it introduces a value function trained through expectile regression,
380
+ which yields a conservative estimate of returns. This allows policy improvement through advantage-weighted behavior cloning,
381
+ ensuring safer generalization without explicit exploration.
382
+
383
+ The `IQLLearner ` replaces the usual TD-based value loss with an expectile regression loss, and trains the policy to imitate
384
+ high-advantage actions—enabling substantial performance gains over the behavior policy using only in-dataset actions.
385
+
386
+ **Tuned examples: **
387
+ `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/iql/pendulum_iql.py >`__
388
+
389
+ **IQL-specific configs ** and :ref: `generic algorithm settings <rllib-algo-configuration-generic-settings >`):
390
+
391
+ .. autoclass :: ray.rllib.algorithms.iql.iql.IQLConfig
392
+ :members: training
393
+
366
394
.. _marwil :
367
395
368
396
Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
@@ -376,7 +404,7 @@ Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
376
404
377
405
**MARWIL architecture: ** MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on
378
406
batched historical data. When the ``beta `` hyperparameter is set to zero, the MARWIL objective reduces to plain
379
- imitation learning (see `BC `_). MARWIL uses Ray.Data to tap into its parallel data
407
+ imitation learning (see `BC `_). MARWIL uses Ray. Data to tap into its parallel data
380
408
processing capabilities. In one training iteration, MARWIL reads episodes in parallel from offline files,
381
409
for example `parquet <https://parquet.apache.org/ >`__, by the n DataWorkers. Connector pipelines preprocess these
382
410
episodes into train batches and send these as data iterators directly to the n Learners for updating the model.
0 commit comments