Skip to content

Commit 10aaf03

Browse files
simonsays1980dioptre
authored andcommitted
[RLlib] Add docs for Implicit Q-Learning. (ray-project#55422)
Signed-off-by: Andrew Grosser <[email protected]>
1 parent 045a0e5 commit 10aaf03

File tree

1 file changed

+30
-2
lines changed

1 file changed

+30
-2
lines changed

doc/source/rllib/rllib-algorithms.rst

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,10 @@ as well as multi-GPU training on multi-node (GPU) clusters when using the `Anysc
3939
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
4040
| :ref:`BC (Behavior Cloning) <bc>` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| |
4141
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
42+
| :ref:`CQL (Conservative Q-Learning) <cql>` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |
43+
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
44+
| :ref:`IQL (Implicit Q-Learning) <iql>` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |
45+
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
4246
| :ref:`MARWIL (Monotonic Advantage Re-Weighted Imitation Learning) <marwil>` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| |
4347
+-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+
4448
| **Algorithm Extensions and -Plugins** |
@@ -183,7 +187,7 @@ Asynchronous Proximal Policy Optimization (APPO)
183187
In a training iteration, APPO requests samples from all EnvRunners asynchronously and the collected episode
184188
samples are returned to the main algorithm process as Ray references rather than actual objects available on the local process.
185189
APPO then passes these episode references to the Learners for asynchronous updates of the model.
186-
RLlib doesn't always synch back the weights to the EnvRunners right after a new model version is available.
190+
RLlib doesn't always sync back the weights to the EnvRunners right after a new model version is available.
187191
To account for the EnvRunners being off-policy, APPO uses a procedure called v-trace,
188192
`described in the IMPALA paper <https://arxiv.org/abs/1802.01561>`__.
189193
APPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners
@@ -363,6 +367,30 @@ Conservative Q-Learning (CQL)
363367
:members: training
364368

365369

370+
.. _iql:
371+
372+
Implicit Q-Learning (IQL)
373+
-------------------------
374+
`[paper] <https://arxiv.org/abs/2110.06169>`__
375+
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/iql/iql.py>`__
376+
377+
**IQL architecture:** IQL (Implicit Q-Learning) is an offline RL algorithm that never needs to evaluate actions outside of
378+
the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through
379+
generalization. Instead of standard TD-error minimization, it introduces a value function trained through expectile regression,
380+
which yields a conservative estimate of returns. This allows policy improvement through advantage-weighted behavior cloning,
381+
ensuring safer generalization without explicit exploration.
382+
383+
The `IQLLearner` replaces the usual TD-based value loss with an expectile regression loss, and trains the policy to imitate
384+
high-advantage actions—enabling substantial performance gains over the behavior policy using only in-dataset actions.
385+
386+
**Tuned examples:**
387+
`Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/iql/pendulum_iql.py>`__
388+
389+
**IQL-specific configs** and :ref:`generic algorithm settings <rllib-algo-configuration-generic-settings>`):
390+
391+
.. autoclass:: ray.rllib.algorithms.iql.iql.IQLConfig
392+
:members: training
393+
366394
.. _marwil:
367395

368396
Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
@@ -376,7 +404,7 @@ Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
376404

377405
**MARWIL architecture:** MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on
378406
batched historical data. When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to plain
379-
imitation learning (see `BC`_). MARWIL uses Ray.Data to tap into its parallel data
407+
imitation learning (see `BC`_). MARWIL uses Ray. Data to tap into its parallel data
380408
processing capabilities. In one training iteration, MARWIL reads episodes in parallel from offline files,
381409
for example `parquet <https://parquet.apache.org/>`__, by the n DataWorkers. Connector pipelines preprocess these
382410
episodes into train batches and send these as data iterators directly to the n Learners for updating the model.

0 commit comments

Comments
 (0)