Update `ray.rllib` to 2.5 #2067

Gamenot · 2023-06-15T05:52:36Z

No description provided.

Gamenot · 2023-06-15T15:54:55Z

examples/rl/rllib/rllib.py

@Adaickalavan @saulfield If possible I would like some input on the approaches I tried here.

Gamenot · 2023-06-15T16:48:37Z

examples/rl/rllib/rllib.py

+    ## Approach 2
+    from ray.rllib.algorithms.algorithm import Algorithm
+
+    algo = algo_config.build()
+    if checkpoint is not None:
+        Algorithm.load_checkpoint(algo, checkpoint=checkpoint)
+    result = {}
+    current_iteration = 0
+    checkpoint_iteration = checkpoint_num or 0
+
+    try:
+        while result.get("time_total_s", 0) < time_total_s:
+            result = algo.train()
+            print(f"======== Iteration {result['training_iteration']} ========")
+            print(result, depth=1)
+
+            if current_iteration % checkpoint_freq == 0:
+                checkpoint_dir = get_checkpoint_dir(checkpoint_iteration)
+                print(f"======= Saving checkpoint {checkpoint_iteration} =======")
+                algo.save_checkpoint(checkpoint_dir)
+                checkpoint_iteration += 1
+            current_iteration += 1
+        algo.save_checkpoint(get_checkpoint_dir(checkpoint_iteration))
+    finally:
+        algo.save(get_checkpoint_dir("latest"))
+
+    algo.stop()


This approach could be helpful because it makes very evident what is happening with the training since very little is automated in terms of schenduling because it does not use Tune. The downside is that it does not get the trialling and hyperparameter manipulation of schedulers like the PopulationBasedTraining that was originally used.

I am unsure if we should completely abandon this approach since it is obvious how it works. I am considering splitting this approach out.

Gamenot · 2023-06-15T16:51:21Z

examples/rl/rllib/rllib.py

+    ## Approach 3
+    # from ray import air
+    # run_config = air.RunConfig(
+    #     name=experiment_name,
+    #     stop={"time_total_s": time_total_s},
+    #     callbacks=[Callbacks],
+    #     storage_path=result_dir,
+    #     checkpoint_config=air.CheckpointConfig(
+    #         num_to_keep=3,
+    #         checkpoint_frequency=checkpoint_freq,
+    #         checkpoint_at_end=True,
+    #     ),
+    #     failure_config=air.FailureConfig(
+    #         max_failures=3,
+    #         fail_fast=False,
+    #     ),
+    #     local_dir=str(result_dir),
+    # )
+    # tune_config = tune.TuneConfig(
+    #     metric="episode_reward_mean",
+    #     mode="max",
+    #     num_samples=num_samples,
+    #     scheduler=pbt,
+    # )
+    # tuner = tune.Tuner(
+    #     "PPO",
+    #     param_space=algo_config,
+    #     tune_config=tune_config,
+    #     run_config=run_config,
+    # )

-    best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max"))
-    model_path = best_logdir / "model"
+    # results = tuner.fit()
+    # # Get the best result based on a particular metric.
+    # best_result = results.get_best_result(metric="episode_reward_mean", mode="max")

-    copy_tree(str(model_path), save_model_path, overwrite=True)
-    print(f"Wrote model to: {save_model_path}")
+    # # Get the best checkpoint corresponding to the best result.
+    # best_checkpoint = best_result.checkpoint


I have not finished testing this approach but it appears to be the modern way of working with Tune and more flexible than the previous two. It uses Tune through the Tuner interface. I had a bit of difficulty figuring out where the algorithm config goes.

The examples appear to show it goes in param_space but that is an unclear argument.

Gamenot · 2023-06-15T16:54:13Z

examples/rl/rllib/rllib.py

+    # analysis = tune.run(
+    #     "PG",
+    #     name=experiment_name,
+    #     stop={"time_total_s": time_total_s},
+    #     checkpoint_freq=checkpoint_freq,
+    #     checkpoint_at_end=True,
+    #     local_dir=str(result_dir),
+    #     resume=resume_training,
+    #     restore=checkpoint,
+    #     max_failures=3,
+    #     num_samples=num_samples,
+    #     export_formats=["model", "checkpoint"],
+    #     config=algo_config,
+    #     scheduler=pbt,
+    # )

-    # XXX: There is a bug in Ray where we can only export a trained model if
-    #      the policy it's attached to is named 'default_policy'.
-    #      See: https://github.com/ray-project/ray/issues/5339
-    rllib_policies = {
-        "default_policy": (
-            None,
-            rllib_agent["observation_space"],
-            rllib_agent["action_space"],
-            {"model": {"custom_model": TrainingModel.NAME}},
-        )
-    }
+    # print(analysis.dataframe().head())

-    smarts.core.seed(seed)
-    tune_config = {
-        "env": RLlibHiWayEnv,
-        "log_level": "WARN",
-        "num_workers": num_workers,
-        "env_config": {
-            "seed": tune.sample_from(lambda spec: random.randint(0, 300)),
-            "scenarios": [str(Path(scenario).expanduser().resolve().absolute())],
-            "headless": not envision,
-            "agent_specs": {
-                f"AGENT-{i}": rllib_agent["agent_spec"] for i in range(num_agents)
-            },
-        },
-        "multiagent": {"policies": rllib_policies},
-        "callbacks": Callbacks,
-    }
+    # best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max"))
+    # model_path = best_logdir / "model"

-    experiment_name = "rllib_example_multi"
-    result_dir = Path(result_dir).expanduser().resolve().absolute()
-    if checkpoint_num:
-        checkpoint = str(
-            result_dir / f"checkpoint_{checkpoint_num}" / f"checkpoint-{checkpoint_num}"
-        )
-    else:
-        checkpoint = None
+    # copy_tree(str(model_path), save_model_path, overwrite=True)
+    # print(f"Wrote model to: {save_model_path}")


The original approach still works after updating the configuration. It appears like the approach is not advocated in any way by the ray documentation. The flow is a bit different though which makes it harder to reproduce with the other approaches.

Gamenot · 2023-06-15T17:01:19Z

examples/rl/rllib/rllib.py

+
+    smarts.core.seed(seed)
+    algo_config = (
+        PGConfig()


The typed config appears to help with generating the example from the code side of things. I somewhat wonder about working with it from a different approach.

It also appears possible that as long as we register the algorithm it should be possible to use the rllib train cli to run our examples using a custom configuration.

Gamenot · 2023-06-15T17:52:47Z

examples/rl/rllib/rllib_agent.py

+    ):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+
+    def forward(self, input_dict, state, seq_lens):


One of the issues here is that the observation and action spaces are expected to be formed at this point (these are given at the start).

I could inject the action and observation space adaptors in here but this is already after the tensors have been generated... I am unsure if there is a way to intercept them. I found an experimental Config.multiagent.observation_fn but it appears it does not inject between the observation and the policy.

We may have to keep the action space and observation space adaptors for RLlibHiWayEnv because the configuration is complicated without them.

I think that those adaptors should still be removed from the agent specification since they are only applicable to configure multi-policy environments.

saulfield

Looks good to me. I don't have strong opinions about which approach to use. I would probably vote for option 3 if I had to choose.

saulfield · 2023-06-15T17:15:38Z

examples/rl/rllib/rllib_agent.py

+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+
+    def forward(self, input_dict, state, seq_lens):
+        # return super().forward(input_dict, state, seq_lens)


Remove?

Suggested change

# return super().forward(input_dict, state, seq_lens)

saulfield · 2023-06-15T17:48:19Z

smarts/env/custom_observations.py

+        neighborhood_vehicle_states = obs.neighborhood_vehicle_states
+
+        # distance of vehicle from center of lane
+


Looks like this comment should be moved down.

Gamenot · 2023-06-16T16:35:10Z

Looks good to me. I don't have strong opinions about which approach to use. I would probably vote for option 3 if I had to choose.

@saulfield OK, I will pursue this option.

I think I will still keep around option 2 but as a different primitive example.

rllib
\ tune_pbt_pg_example.py # Approach 3
\ pg_example.py # Approach 2

Adaickalavan · 2023-06-16T18:11:59Z

examples/rl/rllib/model/saved_model.pb

Since the model files were removed, the SMARTS/examples/rl/rllib/model/README.md file should be removed too.

Adaickalavan

We could go with the latest approach 3, and keep approach 2 as a separate example with its own test case.

Consider adding the rllib example
(i) to the docs at https://smarts.readthedocs.io/en/latest/examples/rl_model.html, and
(ii) to the main readme page at https://github.com/huawei-noah/SMARTS/blob/master/README.md#rl-model

Gamenot · 2023-06-16T21:53:26Z

I have most things working, the main issue left is that it appears like the parallel environments are not getting unique worker_index and vector_index which we originally used for environment diversity within a trial.

Gamenot · 2023-06-20T16:06:20Z

smarts/env/rllib_hiway_env.py

        a = config.worker_index
        b = config.vector_index
        c = (a + b) * (a + b + 1) // 2 + b
-        smarts.core.seed(seed + c)
+        self._seed = seed + c
+        smarts.core.seed(self._seed + c)


One thing to note from this PR is that the original behaviour in ray[rllib]==1.4.0 which used to result in environment seed diversity within the same trail no longer works.

Seed used to vary across all environments during a trial but is now the same within a trial. I am unsure if this has to do with how the environment configuration is distributed.

The implication is that this slows down experience gain.

Adaickalavan · 2023-06-20T17:00:27Z

examples/rl/rllib/pg_pbt_example.py

+        help="Destination path of where to copy the model when training is over",
+    )
+    args = parser.parse_args()
+    build_scenario(scenario=args.scenario, clean=False, seed=42)


Here, using a seed, we first build the scenarios and generate a fixed traffic set. Then, we use the same scenarios in each of the the parallel environments to collect experience. Although the parallel environments might have different seeds and the scenario order might be shuffled, the underlying traffic set appears to be the same since they were generated using the same seed. Should we be building the scenarios with different seeds within each parallel environment instance?

That approach will work only partially because unless we use a true-random seed, the seed is managed by tune in the PBT scheduling. From my investigation, all of the environments are identical within a trial. In that case, if we are building scenarios during the experiment, I am worried about race conditions between the environments because it is not clear how to differentiate between the environments.

The second issue is that the config is specified at the start of the configuration. To do what you ask would require copying and building the scenario at the start of each trial. I think that might be possible through the callbacks, but from what I can tell, the config is intended to be constant and I would need to do some investigation to see if it is possible.

What I am slightly surprised about (see above comment) is that the config.vector_index and config.worker_index values do not appear to be different across workers. And reset(seed=None) seems to be the constant case.

The easier way of working with this is to pre-generate a set of scenario variations at the start, I will expand the number of scenarios.

Gamenot marked this pull request as ready for review June 15, 2023 15:53

Gamenot requested review from saulfield and Adaickalavan June 15, 2023 15:53

Gamenot commented Jun 15, 2023

View reviewed changes

saulfield approved these changes Jun 15, 2023

View reviewed changes

Adaickalavan reviewed Jun 16, 2023

View reviewed changes

Adaickalavan approved these changes Jun 16, 2023

View reviewed changes

Gamenot force-pushed the tucker/upgrade_ray_rllib branch from 02da98e to b05f224 Compare June 20, 2023 15:47

Gamenot commented Jun 20, 2023

View reviewed changes

Adaickalavan reviewed Jun 20, 2023

View reviewed changes

Gamenot added 12 commits June 21, 2023 15:37

Update RLlib.

f6fee23

Update changelog.

531bc30

Fix gymnasium version.

c1336f3

Fix rllib example test.

7acdea2

Fix typing test.

9e452fd

Update examples.

f729cd5

Final updates.

0dd6c9d

Fix tests.

db03de5

Update rllib examples with multi-scenario support.

fcd1b2c

Add default scenario if scenarios not specified.

2d5bce2

Fix learning test.

bfabee1

Update regression rllib.

4c2a0ef

Gamenot force-pushed the tucker/upgrade_ray_rllib branch from b05f224 to 4c2a0ef Compare June 21, 2023 15:37

Gamenot and others added 5 commits June 21, 2023 17:15

Fix pg example main call.

5e4aa91

Make format.

2de0d95

Fix test learning.

7de417e

Final changes for merge.

62a72d9

Merge branch 'master' into tucker/upgrade_ray_rllib

135bd33

Gamenot merged commit b6c9d64 into master Jun 22, 2023

Adaickalavan deleted the tucker/upgrade_ray_rllib branch October 12, 2023 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `ray.rllib` to 2.5 #2067

Update `ray.rllib` to 2.5 #2067

Gamenot commented Jun 15, 2023

Gamenot Jun 15, 2023

Gamenot Jun 15, 2023 •

edited

Loading

Gamenot Jun 15, 2023

Gamenot Jun 15, 2023

Gamenot Jun 15, 2023 •

edited

Loading

Gamenot Jun 15, 2023 •

edited

Loading

saulfield left a comment

saulfield Jun 15, 2023

saulfield Jun 15, 2023

Gamenot commented Jun 16, 2023 •

edited

Loading

Adaickalavan Jun 16, 2023

Adaickalavan left a comment •

edited

Loading

Gamenot commented Jun 16, 2023

Gamenot Jun 20, 2023 •

edited

Loading

Adaickalavan Jun 20, 2023

Gamenot Jun 20, 2023

		neighborhood_vehicle_states = obs.neighborhood_vehicle_states

		# distance of vehicle from center of lane

Update ray.rllib to 2.5 #2067

Update ray.rllib to 2.5 #2067

Conversation

Gamenot commented Jun 15, 2023

Choose a reason for hiding this comment

Gamenot Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gamenot Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

Gamenot Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

saulfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gamenot commented Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

Adaickalavan left a comment • edited Loading

Choose a reason for hiding this comment

Gamenot commented Jun 16, 2023

Gamenot Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Update `ray.rllib` to 2.5 #2067

Update `ray.rllib` to 2.5 #2067

Gamenot Jun 15, 2023 •

edited

Loading

Gamenot Jun 15, 2023 •

edited

Loading

Gamenot Jun 15, 2023 •

edited

Loading

Gamenot commented Jun 16, 2023 •

edited

Loading

Adaickalavan left a comment •

edited

Loading

Gamenot Jun 20, 2023 •

edited

Loading