ray-project · sven1977 · Jan 4, 2025 · Dec 27, 2024 · Dec 31, 2024 · Dec 31, 2024
diff --git a/.vale/styles/config/vocabularies/RLlib/accept.txt b/.vale/styles/config/vocabularies/RLlib/accept.txt
@@ -2,6 +2,7 @@
 # Add new terms alphabetically.
 [Aa]lgos?
 (APPO|appo)
+[Aa]utoscal(e|ing)
 boolean
 [Cc]allables?
 coeff
@@ -28,3 +29,4 @@ rollout
 SGD
 [Tt]ensor[Ff]low
 timesteps?
+vectorizes?
@@ -442,16 +442,12 @@ doctest(
 
 doctest(
     name = "doctest[rllib]",
-    size = "enormous",
+    size = "large",
     files = glob(
         include = [
             "source/rllib/**/*.rst",
             "source/rllib/**/*.md",
         ],
-        exclude = [
-            "source/rllib/rllib-env.rst",
-            "source/rllib/rllib-sample-collection.rst",
-        ],
     ),
     data = ["//rllib:cartpole-v1_large"],
     tags = ["team:rllib"],

@@ -14,6 +14,37 @@ RLlib: Industry-Grade, Scalable Reinforcement Learning
 
 .. sphinx_rllib_readme_end
 
+.. todo (sven): redo toctree:
+    suggestion:
+    getting-started (replaces rllib-training)
+    key-concepts
+    rllib-env (single-agent)
+        ...  <- multi-agent
+        ...  <- external
+        ...  <- hierarchical
+    algorithm-configs
+        rllib-algorithms (overview of all available algos)
+    dev-guide (replaces user-guides)
+        debugging
+        scaling-guide
+        fault-tolerance
+        checkpoints
+        callbacks
+        metrics-logger
+    rllib-advanced-api
+        algorithm (general description of how algos work)
+        rllib-rlmodule
+        rllib-offline
+        single-agent-episode
+        multi-agent-episode
+        connector-v2
+        rllib-learner
+        env-runners
+    rllib-examples
+    rllib-new-api-stack  <- remove?
+    new-api-stack-migration-guide
+    package_ref/index
+
 .. toctree::
     :hidden:
 

@@ -59,7 +59,7 @@ This way, you can scale up sample collection and training, respectively, from a
 
 .. todo: Separate out our scaling guide into its own page in new PR
 
-See this `scaling guide <rllib-training.html#scaling-guide>`__ for more details here.
+See this :ref:`scaling guide <rllib-scaling-guide>` for more details here.
 
 You have two ways to interact with and run an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`:
 

@@ -318,8 +318,11 @@ evaluation (for example you have an environment that sometimes crashes or stalls
 you should use the following combination of settings, minimizing the negative effects
 of such environment behavior:
 
+.. todo (sven): Add link here to new fault-tolerance page, once done.
+    :ref:`fault tolerance settings <rllib-fault-tolerance-docs>`, such as
+
 Note that with or without parallel evaluation, all
-:ref:`fault tolerance settings <rllib-scaling-guide>`, such as
+fault tolerance settings, such as
 ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` are respected and applied
 to the failed evaluation workers.
 

@@ -309,7 +309,7 @@ in combination.
 
 .. tip::
 
-    See the `scaling guide <rllib-training.html#scaling-guide>`__ for more on RLlib training at scale.
+    See the :ref:`scaling guide <rllib-scaling-guide>` for more on RLlib training at scale.
 
 
 Expensive Environments

@@ -183,37 +183,6 @@ an algorithm.
     `custom model classes <rllib-models.html>`__.
 
 
-.. _rllib-scaling-guide:
-
-RLlib Scaling Guide
--------------------
-
-Here are some rules of thumb for scaling training with RLlib.
-
-1. If the environment is slow and cannot be replicated (e.g., since it requires interaction with physical systems), then you should use a sample-efficient off-policy algorithm such as :ref:`DQN <dqn>` or :ref:`SAC <sac>`. These algorithms default to ``num_env_runners: 0`` for single-process operation. Make sure to set ``num_gpus: 1`` if you want to use a GPU. Consider also batch RL training with the `offline data <rllib-offline.html>`__ API.
-
-2. If the environment is fast and the model is small (most models for RL are), use time-efficient algorithms such as :ref:`PPO <ppo>`, or :ref:`IMPALA <impala>`.
-These can be scaled by increasing ``num_env_runners`` to add rollout workers. It may also make sense to enable `vectorization <rllib-env.html#vectorized>`__ for
-inference. Make sure to set ``num_gpus: 1`` if you want to use a GPU. If the learner becomes a bottleneck, you can use multiple GPUs for learning by setting
-``num_gpus > 1``.
-
-3. If the model is compute intensive (e.g., a large deep residual network) and inference is the bottleneck, consider allocating GPUs to workers by setting ``num_gpus_per_env_runner: 1``. If you only have a single GPU, consider ``num_env_runners: 0`` to use the learner GPU for inference. For efficient use of GPU time, use a small number of GPU workers and a large number of `envs per worker <rllib-env.html#vectorized>`__.
-
-4. Finally, if both model and environment are compute intensive, then enable `remote worker envs <rllib-env.html#vectorized>`__ with `async batching <rllib-env.html#vectorized>`__ by setting ``remote_worker_envs: True`` and optionally ``remote_env_batch_wait_ms``. This batches inference on GPUs in the rollout workers while letting envs run asynchronously in separate actors, similar to the `SEED <https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html>`__ architecture. The number of workers and number of envs per worker should be tuned to maximize GPU utilization.
-
-In case you are using lots of workers (``num_env_runners >> 10``) and you observe worker failures for whatever reasons, which normally interrupt your RLlib training runs, consider using
-the config settings ``ignore_env_runner_failures=True``, ``restart_failed_env_runners=True``, or ``restart_failed_sub_environments=True``:
-
-``restart_failed_env_runners``: When set to True (default), your Algorithm will attempt to restart any failed EnvRunner and replace it with a newly created one. This way, your number of workers will never decrease, even if some of them fail from time to time.
-``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to an EnvRunner error, but continue for as long as there is at least one functional worker remaining. This setting is ignored when ``restart_failed_env_runners=True``.
-``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your EnvRunners, RLlib tries to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that EnvRunner.
-
-Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However,
-you can combine each of these with the ``restart_failed_sub_environments=True`` setting.
-Using these options will make your training runs much more stable and more robust against occasional OOM or other similar "once in a while" errors on the EnvRunners
-themselves or inside your custom environments.
-
-
 Debugging RLlib Experiments
 ---------------------------
 

@@ -0,0 +1,194 @@
+.. include:: /_includes/rllib/we_are_hiring.rst
+
+.. include:: /_includes/rllib/new_api_stack.rst
+
+.. _rllib-scaling-guide:
+
+RLlib scaling guide
+===================
+
+RLlib is a distributed and scalable RL library, based on `Ray <https://www.ray.io/>`__. An RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
+uses `Ray actors <https://docs.ray.io/en/latest/ray-core/actors.html>`__ wherever parallelization of
+its sub-components can speed up sample and learning throughput.
+
+.. figure:: images/scaling_axes_overview.svg
+    :width: 600
+    :align: left
+
+    **Scalable axes in RLlib**: Three scaling axes are available across all RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes:
+
+    - The number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors in the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`,
+      settable through ``config.env_runners(num_env_runners=n)``.
+
+    - The number of vectorized sub-environments on each
+      :py:class:`~ray.rllib.env.env_runner.EnvRunner` actor, settable through ``config.env_runners(num_envs_per_env_runner=p)``.
+
+    - The number of :py:class:`~ray.rllib.core.learner.learner.Learner` actors in the
+      :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, settable through ``config.learners(num_learners=m)``.
+
+
+Scaling the number of EnvRunner actors
+--------------------------------------
+
+You can control the degree of parallelism for the sampling machinery of the
+:py:class:`~ray.rllib.algorithms.algorithm.Algorithm` by increasing the number of remote
+:py:class:`~ray.rllib.env.env_runner.EnvRunner` actors in the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`
+through the config as follows.
+
+.. testcode::
+
+    from ray.rllib.algorithms.ppo import PPOConfig
+
+    config = (
+        PPOConfig()
+        # Use 4 EnvRunner actors (default is 2).
+        .env_runners(num_env_runners=4)
+    )
+
+To assign resources to each :py:class:`~ray.rllib.env.env_runner.EnvRunner`, use these config settings:
+
+.. code-block:: python
+
+    config.env_runners(
+        num_cpus_per_env_runner=..,
+        num_gpus_per_env_runner=..,
+    )
+
+See this
+`example of an EnvRunner and RL environment requiring a GPU resource <https://github.com/ray-project/ray/blob/master/rllib/examples/gpus/gpus_on_env_runners.py>`__.
+
+The number of GPUs may be fractional quantities, for example 0.5, to allocate only a fraction of a GPU per
+:py:class:`~ray.rllib.env.env_runner.EnvRunner`.
+
+Note that there's always one "local" :py:class:`~ray.rllib.env.env_runner.EnvRunner` in the
+:py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`.
+If you only want to sample using this local :py:class:`~ray.rllib.env.env_runner.EnvRunner`,
+set ``num_env_runners=0``. This local :py:class:`~ray.rllib.env.env_runner.EnvRunner` directly sits in the main
+:py:class:`~ray.rllib.algorithms.algorithm.Algorithm` process.
+
+.. hint::
+    The Ray team may decide to deprecate the local :py:class:`~ray.rllib.env.env_runner.EnvRunner` some time in the future.
+    It still exists for historical reasons. It's usefulness to keep in the set is still under debate.
+
+
+Scaling the number of envs per EnvRunner actor
+----------------------------------------------
+
+RLlib vectorizes :ref:`RL environments <rllib-key-concepts-environments>` on :py:class:`~ray.rllib.env.env_runner.EnvRunner`
+actors through the `gymnasium's VectorEnv <https://gymnasium.farama.org/api/vector/>`__ API.
+To create more than one environment copy per :py:class:`~ray.rllib.env.env_runner.EnvRunner`, set the following in your config:
+
+.. testcode::
+
+    from ray.rllib.algorithms.ppo import PPOConfig
+
+    config = (
+        PPOConfig()
+        # Use 10 sub-environments (vector) per EnvRunner.
+        .env_runners(num_envs_per_env_runner=10)
+    )
+
+.. note::
+    Unlike single-agent environments, RLlib can't vectorize multi-agent setups yet.
+    The Ray team is working on a solution for this restriction by utilizing
+    `gymnasium >= 1.x` custom vectorization feature.
+
+Doing so allows the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` on the
+:py:class:`~ray.rllib.env.env_runner.EnvRunner` to run inference on a batch of data and
+thus compute actions for all sub-environments in parallel.
+
+By default, the individual sub-environments in a vector ``step`` and ``reset``, in sequence, making only
+the action computation of the RL environment loop parallel, because observations can move through the model
+in a batch.
+However, `gymnasium <https://gymnasium.farama.org/>`__ supports an asynchronous
+vectorization setting, in which each sub-environment receives its own Python process.
+This way, the vector environment can ``step`` or ``reset`` in parallel. Activate
+this asynchronous vectorization behavior through:
+
+.. testcode::
+
+    import gymnasium as gym
+
+    config.env_runners(
+        gym_env_vectorize_mode=gym.envs.registration.VectorizeMode.ASYNC,  # default is `SYNC`
+    )
+
+This setting can speed up the sampling process significantly in combination with ``num_envs_per_env_runner > 1``,
+especially when your RL environment's stepping process is time consuming.
+
+See this `example script <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/async_gym_env_vectorization.py>`__  that demonstrates a massive speedup with async vectorization.
+
+
+Scaling the number of Learner actors
+------------------------------------
+
+Learning updates happen in the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, which manages either a single,
+local :py:class:`~ray.rllib.core.learner.learner.Learner` instance or any number of remote
+:py:class:`~ray.rllib.core.learner.learner.Learner` actors.
+
+Set the number of remote :py:class:`~ray.rllib.core.learner.learner.Learner` actors through:
+
+.. testcode::
+
+    from ray.rllib.algorithms.ppo import PPOConfig
+
+    config = (
+        PPOConfig()
+        # Use 2 remote Learner actors (default is 0) for distributed data parallelism.
+        # Choosing 0 creates a local Learner instance on the main Algorithm process.
+        .learners(num_learners=2)
+    )
+
+Typically, you use as many :py:class:`~ray.rllib.core.learner.learner.Learner` actors as you have GPUs available for training.
+Make sure to set the number of GPUs per :py:class:`~ray.rllib.core.learner.learner.Learner` to 1:
+
+.. testcode::
+
+    config.learners(num_gpus_per_learner=1)
+
+.. warning::
+    For some algorithms, such as IMPALA and APPO, the performance of a single remote
+    :py:class:`~ray.rllib.core.learner.learner.Learner` actor (``num_learners=1``) compared to a
+    single local :py:class:`~ray.rllib.core.learner.learner.Learner` instance (``num_learners=0``),
+    depends on whether you have a GPU available or not.
+    If exactly one GPU is available, you should run these two algorithms with ``num_learners=0, num_gpus_per_learner=1``,
+    if no GPU is available, set ``num_learners=1, num_gpus_per_learner=0``. If more than 1 GPU is available,
+    set ``num_learners=.., num_gpus_per_learner=1``.
+
+The number of GPUs may be fractional quantities, for example 0.5, to allocate only a fraction of a GPU per
+:py:class:`~ray.rllib.env.env_runner.EnvRunner`. For example, you can pack five :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
+instances onto one GPU by setting ``num_learners=1, num_gpus_per_learner=0.2``.
+See this `fractional GPU example <https://github.com/ray-project/ray/blob/master/rllib/examples/gpus/fractional_gpus.py>`__
+for details.
+
+.. note::
+    If you specify ``num_gpus_per_learner > 0`` and your machine doesn't have the required number of GPUs
+    available, the experiment may stall until the Ray autoscaler brings up enough machines to fulfill the resource request.
+    If your cluster has autoscaling turned off, this setting then results in a seemingly hanging experiment run.
+
+    On the other hand, if you set ``num_gpus_per_learner=0``, RLlib builds the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`
+    instances solely on CPUs, even if GPUs are available on the cluster.
+
+
+Outlook: More RLlib elements that should scale
+----------------------------------------------
+
+There are other components and aspects in RLlib that should be able to scale up.
+
+For example, the model size is limited to whatever fits on a single GPU, due to
+"distributed data parallel" (DDP) being the only way in which RLlib scales :py:class:`~ray.rllib.core.learner.learner.Learner`
+actors.
+
+The Ray team is working on closing these gaps. In particular, future areas of improvements are:
+
+- Enable **training very large models**, such as a "large language model" (LLM). The team is actively working on a
+  "Reinforcement Learning from Human Feedback" (RLHF) prototype setup. The main problems to solve are the
+  model-parallel and tensor-parallel distribution across multiple GPUs, as well as, a reasonably fast transfer of
+  weights between Ray actors.
+
+- Enable training with **thousands of multi-agent policies**. A possible solution for this scaling problem
+  could be to split up the :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` into
+  manageable groups of individual policies across the various :py:class:`~ray.rllib.env.env_runner.EnvRunner`
+  and :py:class:`~ray.rllib.core.learner.learner.Learner` actors.
+
+- Enabling **vector envs for multi-agent**.
@@ -23,6 +23,7 @@ User Guides
     rllib-torch2x
     rllib-fault-tolerance
     rllib-dev
+    scaling-guide
 
 
 .. _rllib-feature-guide:
@@ -97,3 +98,11 @@ RLlib Feature Guides
         .. button-ref:: rllib-dev
 
             How to contribute to RLlib?
+
+    .. grid-item-card::
+        :img-top: /rllib/images/rllib-logo.svg
+        :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img
+
+        .. button-ref:: scaling-guide
+
+            How to run RLlib experiments at scale