Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ray.rllib to 2.5 #2067

Merged
merged 17 commits into from
Jun 22, 2023
Merged

Update ray.rllib to 2.5 #2067

merged 17 commits into from
Jun 22, 2023

Conversation

Gamenot
Copy link
Collaborator

@Gamenot Gamenot commented Jun 15, 2023

No description provided.

@Gamenot Gamenot marked this pull request as ready for review June 15, 2023 15:53
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Adaickalavan @saulfield If possible I would like some input on the approaches I tried here.

Comment on lines 223 to 249
## Approach 2
from ray.rllib.algorithms.algorithm import Algorithm

algo = algo_config.build()
if checkpoint is not None:
Algorithm.load_checkpoint(algo, checkpoint=checkpoint)
result = {}
current_iteration = 0
checkpoint_iteration = checkpoint_num or 0

try:
while result.get("time_total_s", 0) < time_total_s:
result = algo.train()
print(f"======== Iteration {result['training_iteration']} ========")
print(result, depth=1)

if current_iteration % checkpoint_freq == 0:
checkpoint_dir = get_checkpoint_dir(checkpoint_iteration)
print(f"======= Saving checkpoint {checkpoint_iteration} =======")
algo.save_checkpoint(checkpoint_dir)
checkpoint_iteration += 1
current_iteration += 1
algo.save_checkpoint(get_checkpoint_dir(checkpoint_iteration))
finally:
algo.save(get_checkpoint_dir("latest"))

algo.stop()
Copy link
Collaborator Author

@Gamenot Gamenot Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach could be helpful because it makes very evident what is happening with the training since very little is automated in terms of schenduling because it does not use Tune. The downside is that it does not get the trialling and hyperparameter manipulation of schedulers like the PopulationBasedTraining that was originally used.

I am unsure if we should completely abandon this approach since it is obvious how it works. I am considering splitting this approach out.

Comment on lines 251 to 287
## Approach 3
# from ray import air
# run_config = air.RunConfig(
# name=experiment_name,
# stop={"time_total_s": time_total_s},
# callbacks=[Callbacks],
# storage_path=result_dir,
# checkpoint_config=air.CheckpointConfig(
# num_to_keep=3,
# checkpoint_frequency=checkpoint_freq,
# checkpoint_at_end=True,
# ),
# failure_config=air.FailureConfig(
# max_failures=3,
# fail_fast=False,
# ),
# local_dir=str(result_dir),
# )
# tune_config = tune.TuneConfig(
# metric="episode_reward_mean",
# mode="max",
# num_samples=num_samples,
# scheduler=pbt,
# )
# tuner = tune.Tuner(
# "PPO",
# param_space=algo_config,
# tune_config=tune_config,
# run_config=run_config,
# )

best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max"))
model_path = best_logdir / "model"
# results = tuner.fit()
# # Get the best result based on a particular metric.
# best_result = results.get_best_result(metric="episode_reward_mean", mode="max")

copy_tree(str(model_path), save_model_path, overwrite=True)
print(f"Wrote model to: {save_model_path}")
# # Get the best checkpoint corresponding to the best result.
# best_checkpoint = best_result.checkpoint
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not finished testing this approach but it appears to be the modern way of working with Tune and more flexible than the previous two. It uses Tune through the Tuner interface. I had a bit of difficulty figuring out where the algorithm config goes.

The examples appear to show it goes in param_space but that is an unclear argument.

Comment on lines 199 to 221
# analysis = tune.run(
# "PG",
# name=experiment_name,
# stop={"time_total_s": time_total_s},
# checkpoint_freq=checkpoint_freq,
# checkpoint_at_end=True,
# local_dir=str(result_dir),
# resume=resume_training,
# restore=checkpoint,
# max_failures=3,
# num_samples=num_samples,
# export_formats=["model", "checkpoint"],
# config=algo_config,
# scheduler=pbt,
# )

# XXX: There is a bug in Ray where we can only export a trained model if
# the policy it's attached to is named 'default_policy'.
# See: https://github.com/ray-project/ray/issues/5339
rllib_policies = {
"default_policy": (
None,
rllib_agent["observation_space"],
rllib_agent["action_space"],
{"model": {"custom_model": TrainingModel.NAME}},
)
}
# print(analysis.dataframe().head())

smarts.core.seed(seed)
tune_config = {
"env": RLlibHiWayEnv,
"log_level": "WARN",
"num_workers": num_workers,
"env_config": {
"seed": tune.sample_from(lambda spec: random.randint(0, 300)),
"scenarios": [str(Path(scenario).expanduser().resolve().absolute())],
"headless": not envision,
"agent_specs": {
f"AGENT-{i}": rllib_agent["agent_spec"] for i in range(num_agents)
},
},
"multiagent": {"policies": rllib_policies},
"callbacks": Callbacks,
}
# best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max"))
# model_path = best_logdir / "model"

experiment_name = "rllib_example_multi"
result_dir = Path(result_dir).expanduser().resolve().absolute()
if checkpoint_num:
checkpoint = str(
result_dir / f"checkpoint_{checkpoint_num}" / f"checkpoint-{checkpoint_num}"
)
else:
checkpoint = None
# copy_tree(str(model_path), save_model_path, overwrite=True)
# print(f"Wrote model to: {save_model_path}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original approach still works after updating the configuration. It appears like the approach is not advocated in any way by the ray documentation. The flow is a bit different though which makes it harder to reproduce with the other approaches.


smarts.core.seed(seed)
algo_config = (
PGConfig()
Copy link
Collaborator Author

@Gamenot Gamenot Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typed config appears to help with generating the example from the code side of things. I somewhat wonder about working with it from a different approach.

It also appears possible that as long as we register the algorithm it should be possible to use the rllib train cli to run our examples using a custom configuration.

):
super().__init__(obs_space, action_space, num_outputs, model_config, name)

def forward(self, input_dict, state, seq_lens):
Copy link
Collaborator Author

@Gamenot Gamenot Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues here is that the observation and action spaces are expected to be formed at this point (these are given at the start).

I could inject the action and observation space adaptors in here but this is already after the tensors have been generated... I am unsure if there is a way to intercept them. I found an experimental Config.multiagent.observation_fn but it appears it does not inject between the observation and the policy.

We may have to keep the action space and observation space adaptors for RLlibHiWayEnv because the configuration is complicated without them.

I think that those adaptors should still be removed from the agent specification since they are only applicable to configure multi-policy environments.

Copy link
Contributor

@saulfield saulfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I don't have strong opinions about which approach to use. I would probably vote for option 3 if I had to choose.

super().__init__(obs_space, action_space, num_outputs, model_config, name)

def forward(self, input_dict, state, seq_lens):
# return super().forward(input_dict, state, seq_lens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Suggested change
# return super().forward(input_dict, state, seq_lens)

neighborhood_vehicle_states = obs.neighborhood_vehicle_states

# distance of vehicle from center of lane

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment should be moved down.

@Gamenot
Copy link
Collaborator Author

Gamenot commented Jun 16, 2023

Looks good to me. I don't have strong opinions about which approach to use. I would probably vote for option 3 if I had to choose.

@saulfield OK, I will pursue this option.

I think I will still keep around option 2 but as a different primitive example.

rllib
\ tune_pbt_pg_example.py # Approach 3
\ pg_example.py # Approach 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the model files were removed, the SMARTS/examples/rl/rllib/model/README.md file should be removed too.

Copy link
Member

@Adaickalavan Adaickalavan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go with the latest approach 3, and keep approach 2 as a separate example with its own test case.

Consider adding the rllib example
(i) to the docs at https://smarts.readthedocs.io/en/latest/examples/rl_model.html, and
(ii) to the main readme page at https://github.com/huawei-noah/SMARTS/blob/master/README.md#rl-model

@Gamenot
Copy link
Collaborator Author

Gamenot commented Jun 16, 2023

I have most things working, the main issue left is that it appears like the parallel environments are not getting unique worker_index and vector_index which we originally used for environment diversity within a trial.

Comment on lines 116 to +120
a = config.worker_index
b = config.vector_index
c = (a + b) * (a + b + 1) // 2 + b
smarts.core.seed(seed + c)
self._seed = seed + c
smarts.core.seed(self._seed + c)
Copy link
Collaborator Author

@Gamenot Gamenot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note from this PR is that the original behaviour in ray[rllib]==1.4.0 which used to result in environment seed diversity within the same trail no longer works.

Seed used to vary across all environments during a trial but is now the same within a trial. I am unsure if this has to do with how the environment configuration is distributed.

The implication is that this slows down experience gain.

help="Destination path of where to copy the model when training is over",
)
args = parser.parse_args()
build_scenario(scenario=args.scenario, clean=False, seed=42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, using a seed, we first build the scenarios and generate a fixed traffic set. Then, we use the same scenarios in each of the the parallel environments to collect experience. Although the parallel environments might have different seeds and the scenario order might be shuffled, the underlying traffic set appears to be the same since they were generated using the same seed. Should we be building the scenarios with different seeds within each parallel environment instance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach will work only partially because unless we use a true-random seed, the seed is managed by tune in the PBT scheduling. From my investigation, all of the environments are identical within a trial. In that case, if we are building scenarios during the experiment, I am worried about race conditions between the environments because it is not clear how to differentiate between the environments.

The second issue is that the config is specified at the start of the configuration. To do what you ask would require copying and building the scenario at the start of each trial. I think that might be possible through the callbacks, but from what I can tell, the config is intended to be constant and I would need to do some investigation to see if it is possible.

What I am slightly surprised about (see above comment) is that the config.vector_index and config.worker_index values do not appear to be different across workers. And reset(seed=None) seems to be the constant case.

The easier way of working with this is to pre-generate a set of scenario variations at the start, I will expand the number of scenarios.

@Gamenot Gamenot merged commit b6c9d64 into master Jun 22, 2023
@Adaickalavan Adaickalavan deleted the tucker/upgrade_ray_rllib branch October 12, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants