huawei-noah · Gamenot · Jun 22, 2023 · Jun 15, 2023 · Jun 15, 2023 · Jun 15, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,9 +10,17 @@ Copy and pasting the git commit messages is __NOT__ enough.
 
 ## [Unreleased]
 ### Added
+- Added `rllib/pg_example.py` to demonstrate a simple integration with `RLlib` and `tensorflow` for policy training.
+- Added `rllib/pg_pbt_example.py` to demonstrate integration with `ray.RLlib`, `tensorflow`, and `ray.tune` for scheduled policy training.
 ### Changed
+- Updated `smarts[ray]` (`ray==2.2`) and `smarts[rllib]` (`ray[rllib]==1.4`) to use `ray~=2.5`.
+- Introduced `tensorflow-probability` to `smarts[rllib]`.
+- Updated `RLlibHiWayEnv` to use the `gymnasium` interface.
+- Renamed `rllib/rllib.py` to `rllib/pg_pbt_example.py`.
+- Loosened constraint of `gymnasium` from `==0.27.0` to `>=0.26.3`.
 ### Deprecated
 ### Fixed
+- Missing neighborhood vehicle state `'lane_id'` is now added to the `hiway-v1` formatted observations.
 - Fixed a regression where `pybullet` build time messages returned.
 ### Removed
 ### Security

diff --git a/README.md b/README.md
@@ -44,6 +44,8 @@ Several agent control policies and agent [action types](smarts/core/controllers/
 ### RL Model
 1. [Drive](examples/rl/drive). See [Driving SMARTS 2023.1 & 2023.2](https://smarts.readthedocs.io/en/latest/benchmarks/driving_smarts_2023_1.html) for more info.
 1. [VehicleFollowing](examples/rl/platoon). See [Driving SMARTS 2023.3](https://smarts.readthedocs.io/en/latest/benchmarks/driving_smarts_2023_3.html) for more info.
+1. [PG](examples/rl/rllib/pg_example.py). See [RLlib](https://smarts.readthedocs.io/en/latest/docs/ecosystem/rllib.html) for more info.
+1. [PG Population Based Training](examples/rl/rllib/pg_pbt_example.py). See [RLlib](https://smarts.readthedocs.io/en/latest/docs/ecosystem/rllib.html) for more info.
 
 ### RL Environment
 1. [ULTRA](https://github.com/smarts-project/smarts-project.rl/blob/master/ultra) provides a gym-based environment built upon SMARTS to tackle intersection navigation, specifically the unprotected left turn.

diff --git a/docs/ecosystem/rllib.rst b/docs/ecosystem/rllib.rst
@@ -8,6 +8,13 @@ RLlib
 of applications. ``RLlib`` natively supports ``TensorFlow``, ``TensorFlow Eager``, and ``PyTorch``. Most of its internals are agnostic to such
 deep learning frameworks.
 
+SMARTS contains two examples using `Policy Gradients (PG) <https://docs.ray.io/en/latest/rllib-algorithms.html#policy-gradients-pg>`_.
+
+1. ``rllib/pg_example.py``
+This example shows the basics of using RLlib with SMARTS through :class:`~smarts.env.rllib_hiway_env.RLlibHiWayEnv`.
+1. ``rllib/pg_pbt_example.py``
+This example combines Policy Gradients with `Population Based Training (PBT) <https://docs.ray.io/en/latest/tune/api/doc/ray.tune.schedulers.PopulationBasedTraining.html>`_ scheduling.
+
 Recommended reads
 -----------------
 
@@ -28,7 +35,7 @@ many docs about ``Ray`` and ``RLlib``. We recommend to read the following pages
 Resume training
 ---------------
 
-With respect to ``SMARTS/examples/rl/rllib`` example, if you want to continue an aborted experiment, you can set ``resume=True`` in ``tune.run``. But note that ``resume=True`` will continue to use the same configuration as was set in the original experiment.
+With respect to ``SMARTS/examples/rl/rllib`` examples, if you want to continue an aborted experiment, you can set ``resume_training=True``. But note that ``resume_training=True`` will continue to use the same configuration as was set in the original experiment.
 To make changes to a started experiment, you can edit the latest experiment file in ``./results``.
 
-Or if you want to start a new experiment but train from an existing checkpoint, you can set ``restore=checkpoint_path`` in ``tune.run``.
+Or if you want to start a new experiment but train from an existing checkpoint, you will need to look into `How to Save and Load Trial Checkpoints <https://docs.ray.io/en/latest/tune/tutorials/tune-trial-checkpoints>`_.
diff --git a/docs/sim/env.rst b/docs/sim/env.rst
@@ -9,19 +9,20 @@ Base environments
 SMARTS environment module is defined in :mod:`~smarts.env` package. Currently SMARTS provides two kinds of training 
 environments, namely:
 
-+ ``HiWayEnv`` utilizing ``gym.env`` style interface 
++ ``HiWayEnv`` utilizing a ``gymnasium.Env`` interface 
 + ``RLlibHiwayEnv`` customized for `RLlib <https://docs.ray.io/en/latest/rllib/index.html>`_ training
 
 .. image:: ../_static/env.png
 
 HiWayEnv
 ^^^^^^^^
 
-``HiWayEnv`` inherits class ``gym.Env`` and supports gym APIs like ``reset``, ``step``, ``close``. An usage example is shown below.
+``HiWayEnv`` inherits class ``gymnasium.Env`` and supports gym APIs like ``reset``, ``step``, ``close``. An usage example is shown below.
 Refer to :class:`~smarts.env.hiway_env.HiWayEnv` for more details.
 
 .. code-block:: python
 
+    import gymnasium as gym
     # Make env
     env = gym.make(
             "smarts.env:hiway-v0", # Env entry name.
@@ -53,6 +54,7 @@ exactly matches the `env.observation_space`, and `ObservationOptions.multi_agent
 
 .. code-block:: python
 
+    import gymnasium as gym
     # Make env
     env = gym.make(
             "smarts.env:hiway-v1", # Env entry name.
@@ -81,6 +83,7 @@ This can be done with :class:`~smarts.env.gymnasium.wrappers.api_reversion.Api02
 
 .. code-block:: python
 
+    import gymnasium as gym
     # Make env
     env = gym.make(
         "smarts.env:hiway-v1", # Env entry name.
@@ -91,7 +94,7 @@ This can be done with :class:`~smarts.env.gymnasium.wrappers.api_reversion.Api02
 RLlibHiwayEnv
 ^^^^^^^^^^^^^
 
-``RLlibHiwayEnv`` inherits class ``MultiAgentEnv``, which is defined in `RLlib <https://docs.ray.io/en/latest/rllib/index.html>`_. It also supports common env APIs like ``reset``, 
+``RLlibHiwayEnv`` inherits class ``MultiAgentEnv``, which is defined in `RLlib <https://docs.ray.io/en/latest/rllib/index.html>`_. It also supports common environment APIs like ``reset``, 
 ``step``, ``close``. An usage example is shown below. Refer to :class:`~smarts.env.rllib_hiway_env.RLlibHiWayEnv` for more details.
 
 .. code-block:: python

diff --git a/examples/rl/rllib/configs.py b/examples/rl/rllib/configs.py
@@ -0,0 +1,69 @@
+import argparse
+import multiprocessing
+from pathlib import Path
+
+
+def gen_parser(prog: str, default_result_dir: str) -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(prog)
+    parser.add_argument(
+        "scenarios",
+        help="A list of scenarios. Each element can be either the scenario to"
+        "run or a directory of scenarios to sample from. See `scenarios/`"
+        "folder for some samples you can use.",
+        type=str,
+        nargs="*",
+    )
+    parser.add_argument(
+        "--envision",
+        action="store_true",
+        help="Run simulation with Envision display.",
+    )
+    parser.add_argument(
+        "--train_batch_size",
+        type=int,
+        default=2000,
+        help="The training batch size. This value must be > 0.",
+    )
+    parser.add_argument(
+        "--time_total_s",
+        type=int,
+        default=1 * 60 * 60,  # 1 hour
+        help="Total time in seconds to run the simulation for. This is a rough end time as it will be checked per training batch.",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="The base random seed to use, intended to be mixed with --num_samples",
+    )
+    parser.add_argument(
+        "--num_agents", type=int, default=2, help="Number of agents (one per policy)"
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=(multiprocessing.cpu_count() // 2 + 1),
+        help="Number of workers (defaults to use all system cores)",
+    )
+    parser.add_argument(
+        "--resume_training",
+        default=False,
+        action="store_true",
+        help="Resume an errored or 'ctrl+c' cancelled training. This does not extend a fully run original experiment.",
+    )
+    parser.add_argument(
+        "--result_dir",
+        type=str,
+        default=default_result_dir,
+        help="Directory containing results",
+    )
+    parser.add_argument(
+        "--log_level",
+        type=str,
+        default="ERROR",
+        help="Log level (DEBUG|INFO|WARN|ERROR)",
+    )
+    parser.add_argument(
+        "--checkpoint_freq", type=int, default=3, help="Checkpoint frequency"
+    )
+    return parser
diff --git a/examples/rl/rllib/model/README.md b/examples/rl/rllib/model/README.md
@@ -1,3 +1,3 @@
 ## Model Binaries
 
-The binaries located in this directory are the components of a trained rllib model. These are related to the `examples/rl/rllib/rllib.py` example script. Results from `examples/rl/rllib/rllib.py` are loaded and written to this directory.
+The binaries located in this directory are the components of a trained rllib model. These are related to the `examples/rl/rllib/pg_pbt_example.py` example script. Results from `examples/rl/rllib/pg_pbt_example.py` are loaded and written to this directory.
diff --git a/examples/rl/rllib/model/saved_model.pb b/examples/rl/rllib/model/saved_model.pb
diff --git a/examples/rl/rllib/model/variables/variables.data-00000-of-00001 b/examples/rl/rllib/model/variables/variables.data-00000-of-00001
diff --git a/examples/rl/rllib/model/variables/variables.index b/examples/rl/rllib/model/variables/variables.index
diff --git a/examples/rl/rllib/pg_example.py b/examples/rl/rllib/pg_example.py
@@ -0,0 +1,218 @@
+from pathlib import Path
+from pprint import pprint as print
+from typing import Dict, Literal, Optional, Union
+
+import numpy as np
+
+try:
+    from ray.rllib.algorithms.algorithm import Algorithm, AlgorithmConfig
+    from ray.rllib.algorithms.callbacks import DefaultCallbacks
+    from ray.rllib.algorithms.pg import PGConfig
+    from ray.rllib.env.base_env import BaseEnv
+    from ray.rllib.evaluation.episode import Episode
+    from ray.rllib.evaluation.episode_v2 import EpisodeV2
+    from ray.rllib.evaluation.rollout_worker import RolloutWorker
+    from ray.rllib.policy.policy import Policy
+    from ray.rllib.utils.typing import PolicyID
+except Exception as e:
+    from smarts.core.utils.custom_exceptions import RayException
+
+    raise RayException.required_to("rllib.py")
+
+import smarts
+from smarts.env.rllib_hiway_env import RLlibHiWayEnv
+from smarts.sstudio.scenario_construction import build_scenarios
+
+if __name__ == "__main__":
+    from configs import gen_parser
+    from rllib_agent import TrainingModel, rllib_agent
+else:
+    from .configs import gen_parser
+    from .rllib_agent import TrainingModel, rllib_agent
+
+# Add custom metrics to your tensorboard using these callbacks
+# See: https://ray.readthedocs.io/en/latest/rllib-training.html#callbacks-and-custom-metrics
+class Callbacks(DefaultCallbacks):
+    @staticmethod
+    def on_episode_start(
+        worker: RolloutWorker,
+        base_env: BaseEnv,
+        policies: Dict[PolicyID, Policy],
+        episode: Union[Episode, EpisodeV2],
+        env_index: int,
+        **kwargs,
+    ):
+
+        episode.user_data["ego_reward"] = []
+
+    @staticmethod
+    def on_episode_step(
+        worker: RolloutWorker,
+        base_env: BaseEnv,
+        episode: Union[Episode, EpisodeV2],
+        env_index: int,
+        **kwargs,
+    ):
+        single_agent_id = list(episode.get_agents())[0]
+        infos = episode._last_infos.get(single_agent_id)
+        if infos is not None:
+            episode.user_data["ego_reward"].append(infos["reward"])
+
+    @staticmethod
+    def on_episode_end(
+        worker: RolloutWorker,
+        base_env: BaseEnv,
+        policies: Dict[PolicyID, Policy],
+        episode: Union[Episode, EpisodeV2],
+        env_index: int,
+        **kwargs,
+    ):
+
+        mean_ego_speed = np.mean(episode.user_data["ego_reward"])
+        print(
+            f"ep. {episode.episode_id:<12} ended;"
+            f" length={episode.length:<6}"
+            f" mean_ego_reward={mean_ego_speed:.2f}"
+        )
+        episode.custom_metrics["mean_ego_reward"] = mean_ego_speed
+
+
+def main(
+    scenarios,
+    envision,
+    time_total_s,
+    rollout_fragment_length,
+    train_batch_size,
+    seed,
+    num_agents,
+    num_workers,
+    resume_training,
+    result_dir,
+    checkpoint_freq: int,
+    checkpoint_num: Optional[int],
+    log_level: Literal["DEBUG", "INFO", "WARN", "ERROR"],
+):
+    rllib_policies = {
+        f"AGENT-{i}": (
+            None,
+            rllib_agent["observation_space"],
+            rllib_agent["action_space"],
+            {"model": {"custom_model": TrainingModel.NAME}},
+        )
+        for i in range(num_agents)
+    }
+    agent_specs = {f"AGENT-{i}": rllib_agent["agent_spec"] for i in range(num_agents)}
+
+    smarts.core.seed(seed)
+    assert len(set(rllib_policies.keys()).difference(agent_specs)) == 0
+    algo_config: AlgorithmConfig = (
+        PGConfig()
+        .environment(
+            env=RLlibHiWayEnv,
+            env_config={
+                "seed": seed,
+                "scenarios": [
+                    str(Path(scenario).expanduser().resolve().absolute())
+                    for scenario in scenarios
+                ],
+                "headless": not envision,
+                "agent_specs": agent_specs,
+                "observation_options": "multi_agent",
+            },
+            disable_env_checking=True,
+        )
+        .framework(framework="tf2", eager_tracing=True)
+        .rollouts(
+            rollout_fragment_length=rollout_fragment_length,
+            num_rollout_workers=num_workers,
+            num_envs_per_worker=1,
+            enable_tf1_exec_eagerly=True,
+        )
+        .training(
+            lr_schedule=[(0, 1e-3), (1e3, 5e-4), (1e5, 1e-4), (1e7, 5e-5), (1e8, 1e-5)],
+            train_batch_size=train_batch_size,
+        )
+        .multi_agent(
+            policies=rllib_policies,
+            policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: f"{agent_id}",
+        )
+        .callbacks(callbacks_class=Callbacks)
+        .debugging(log_level=log_level)
+    )
+
+    def get_checkpoint_dir(num):
+        checkpoint_dir = Path(result_dir) / f"checkpoint_{num}" / f"checkpoint-{num}"
+        checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        return checkpoint_dir
+
+    if resume_training:
+        checkpoint = str(get_checkpoint_dir("latest"))
+    if checkpoint_num:
+        checkpoint = str(get_checkpoint_dir(checkpoint_num))
+    else:
+        checkpoint = None
+
+    print(f"======= Checkpointing at {str(result_dir)} =======")
+
+    algo = algo_config.build()
+    if checkpoint is not None:
+        algo.load_checkpoint(checkpoint=checkpoint)
+    result = {}
+    current_iteration = 0
+    checkpoint_iteration = checkpoint_num or 0
+
+    try:
+        while result.get("time_total_s", 0) < time_total_s:
+            result = algo.train()
+            print(f"======== Iteration {result['training_iteration']} ========")
+            print(result, depth=1)
+
+            if current_iteration % checkpoint_freq == 0:
+                checkpoint_dir = get_checkpoint_dir(checkpoint_iteration)
+                print(f"======= Saving checkpoint {checkpoint_iteration} =======")
+                algo.save_checkpoint(checkpoint_dir)
+                checkpoint_iteration += 1
+            current_iteration += 1
+        algo.save_checkpoint(get_checkpoint_dir(checkpoint_iteration))
+    finally:
+        algo.save_checkpoint(get_checkpoint_dir("latest"))
+        algo.stop()
+
+
+if __name__ == "__main__":
+    default_result_dir = str(Path(__file__).resolve().parent / "results" / "pg_results")
+    parser = gen_parser("rllib-example", default_result_dir)
+    parser.add_argument(
+        "--checkpoint_num",
+        type=int,
+        default=None,
+        help="The checkpoint number to restart from.",
+    )
+    parser.add_argument(
+        "--rollout_fragment_length",
+        type=str,
+        default="auto",
+        help="Episodes are divided into fragments of this many steps for each rollout. In this example this will be ensured to be `1=<rollout_fragment_length<=train_batch_size`",
+    )
+    args = parser.parse_args()
+    if not args.scenarios:
+        args.scenarios = [
+            str(Path(__file__).absolute().parents[3] / "scenarios" / "sumo" / "loop"),
+        ]
+    build_scenarios(scenarios=args.scenarios, clean=False, seed=args.seed)
+
+    main(
+        scenarios=args.scenarios,
+        envision=args.envision,
+        time_total_s=args.time_total_s,
+        rollout_fragment_length=args.rollout_fragment_length,
+        train_batch_size=args.train_batch_size,
+        seed=args.seed,
+        num_agents=args.num_agents,
+        num_workers=args.num_workers,
+        resume_training=args.resume_training,
+        result_dir=args.result_dir,
+        checkpoint_freq=max(args.checkpoint_freq, 1),
+        checkpoint_num=args.checkpoint_num,
+        log_level=args.log_level,
+    )