-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ray.rllib
to 2.5
#2067
Update ray.rllib
to 2.5
#2067
Conversation
examples/rl/rllib/rllib.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Adaickalavan @saulfield If possible I would like some input on the approaches I tried here.
examples/rl/rllib/rllib.py
Outdated
## Approach 2 | ||
from ray.rllib.algorithms.algorithm import Algorithm | ||
|
||
algo = algo_config.build() | ||
if checkpoint is not None: | ||
Algorithm.load_checkpoint(algo, checkpoint=checkpoint) | ||
result = {} | ||
current_iteration = 0 | ||
checkpoint_iteration = checkpoint_num or 0 | ||
|
||
try: | ||
while result.get("time_total_s", 0) < time_total_s: | ||
result = algo.train() | ||
print(f"======== Iteration {result['training_iteration']} ========") | ||
print(result, depth=1) | ||
|
||
if current_iteration % checkpoint_freq == 0: | ||
checkpoint_dir = get_checkpoint_dir(checkpoint_iteration) | ||
print(f"======= Saving checkpoint {checkpoint_iteration} =======") | ||
algo.save_checkpoint(checkpoint_dir) | ||
checkpoint_iteration += 1 | ||
current_iteration += 1 | ||
algo.save_checkpoint(get_checkpoint_dir(checkpoint_iteration)) | ||
finally: | ||
algo.save(get_checkpoint_dir("latest")) | ||
|
||
algo.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach could be helpful because it makes very evident what is happening with the training since very little is automated in terms of schenduling because it does not use Tune
. The downside is that it does not get the trialling and hyperparameter manipulation of schedulers like the PopulationBasedTraining
that was originally used.
I am unsure if we should completely abandon this approach since it is obvious how it works. I am considering splitting this approach out.
examples/rl/rllib/rllib.py
Outdated
## Approach 3 | ||
# from ray import air | ||
# run_config = air.RunConfig( | ||
# name=experiment_name, | ||
# stop={"time_total_s": time_total_s}, | ||
# callbacks=[Callbacks], | ||
# storage_path=result_dir, | ||
# checkpoint_config=air.CheckpointConfig( | ||
# num_to_keep=3, | ||
# checkpoint_frequency=checkpoint_freq, | ||
# checkpoint_at_end=True, | ||
# ), | ||
# failure_config=air.FailureConfig( | ||
# max_failures=3, | ||
# fail_fast=False, | ||
# ), | ||
# local_dir=str(result_dir), | ||
# ) | ||
# tune_config = tune.TuneConfig( | ||
# metric="episode_reward_mean", | ||
# mode="max", | ||
# num_samples=num_samples, | ||
# scheduler=pbt, | ||
# ) | ||
# tuner = tune.Tuner( | ||
# "PPO", | ||
# param_space=algo_config, | ||
# tune_config=tune_config, | ||
# run_config=run_config, | ||
# ) | ||
|
||
best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max")) | ||
model_path = best_logdir / "model" | ||
# results = tuner.fit() | ||
# # Get the best result based on a particular metric. | ||
# best_result = results.get_best_result(metric="episode_reward_mean", mode="max") | ||
|
||
copy_tree(str(model_path), save_model_path, overwrite=True) | ||
print(f"Wrote model to: {save_model_path}") | ||
# # Get the best checkpoint corresponding to the best result. | ||
# best_checkpoint = best_result.checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not finished testing this approach but it appears to be the modern way of working with Tune
and more flexible than the previous two. It uses Tune
through the Tuner
interface. I had a bit of difficulty figuring out where the algorithm config goes.
The examples appear to show it goes in param_space
but that is an unclear argument.
examples/rl/rllib/rllib.py
Outdated
# analysis = tune.run( | ||
# "PG", | ||
# name=experiment_name, | ||
# stop={"time_total_s": time_total_s}, | ||
# checkpoint_freq=checkpoint_freq, | ||
# checkpoint_at_end=True, | ||
# local_dir=str(result_dir), | ||
# resume=resume_training, | ||
# restore=checkpoint, | ||
# max_failures=3, | ||
# num_samples=num_samples, | ||
# export_formats=["model", "checkpoint"], | ||
# config=algo_config, | ||
# scheduler=pbt, | ||
# ) | ||
|
||
# XXX: There is a bug in Ray where we can only export a trained model if | ||
# the policy it's attached to is named 'default_policy'. | ||
# See: https://github.com/ray-project/ray/issues/5339 | ||
rllib_policies = { | ||
"default_policy": ( | ||
None, | ||
rllib_agent["observation_space"], | ||
rllib_agent["action_space"], | ||
{"model": {"custom_model": TrainingModel.NAME}}, | ||
) | ||
} | ||
# print(analysis.dataframe().head()) | ||
|
||
smarts.core.seed(seed) | ||
tune_config = { | ||
"env": RLlibHiWayEnv, | ||
"log_level": "WARN", | ||
"num_workers": num_workers, | ||
"env_config": { | ||
"seed": tune.sample_from(lambda spec: random.randint(0, 300)), | ||
"scenarios": [str(Path(scenario).expanduser().resolve().absolute())], | ||
"headless": not envision, | ||
"agent_specs": { | ||
f"AGENT-{i}": rllib_agent["agent_spec"] for i in range(num_agents) | ||
}, | ||
}, | ||
"multiagent": {"policies": rllib_policies}, | ||
"callbacks": Callbacks, | ||
} | ||
# best_logdir = Path(analysis.get_best_logdir("episode_reward_max", mode="max")) | ||
# model_path = best_logdir / "model" | ||
|
||
experiment_name = "rllib_example_multi" | ||
result_dir = Path(result_dir).expanduser().resolve().absolute() | ||
if checkpoint_num: | ||
checkpoint = str( | ||
result_dir / f"checkpoint_{checkpoint_num}" / f"checkpoint-{checkpoint_num}" | ||
) | ||
else: | ||
checkpoint = None | ||
# copy_tree(str(model_path), save_model_path, overwrite=True) | ||
# print(f"Wrote model to: {save_model_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original approach still works after updating the configuration. It appears like the approach is not advocated in any way by the ray
documentation. The flow is a bit different though which makes it harder to reproduce with the other approaches.
examples/rl/rllib/rllib.py
Outdated
|
||
smarts.core.seed(seed) | ||
algo_config = ( | ||
PGConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typed config appears to help with generating the example from the code side of things. I somewhat wonder about working with it from a different approach.
It also appears possible that as long as we register the algorithm it should be possible to use the rllib train
cli to run our examples using a custom configuration.
): | ||
super().__init__(obs_space, action_space, num_outputs, model_config, name) | ||
|
||
def forward(self, input_dict, state, seq_lens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the issues here is that the observation and action spaces are expected to be formed at this point (these are given at the start).
I could inject the action and observation space adaptors in here but this is already after the tensors have been generated... I am unsure if there is a way to intercept them. I found an experimental Config.multiagent.observation_fn
but it appears it does not inject between the observation and the policy.
We may have to keep the action space and observation space adaptors for RLlibHiWayEnv
because the configuration is complicated without them.
I think that those adaptors should still be removed from the agent specification since they are only applicable to configure multi-policy environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I don't have strong opinions about which approach to use. I would probably vote for option 3 if I had to choose.
examples/rl/rllib/rllib_agent.py
Outdated
super().__init__(obs_space, action_space, num_outputs, model_config, name) | ||
|
||
def forward(self, input_dict, state, seq_lens): | ||
# return super().forward(input_dict, state, seq_lens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
# return super().forward(input_dict, state, seq_lens) |
smarts/env/custom_observations.py
Outdated
neighborhood_vehicle_states = obs.neighborhood_vehicle_states | ||
|
||
# distance of vehicle from center of lane | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment should be moved down.
@saulfield OK, I will pursue this option. I think I will still keep around option 2 but as a different primitive example.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the model files were removed, the SMARTS/examples/rl/rllib/model/README.md
file should be removed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could go with the latest approach 3, and keep approach 2 as a separate example with its own test case.
Consider adding the rllib example
(i) to the docs at https://smarts.readthedocs.io/en/latest/examples/rl_model.html, and
(ii) to the main readme page at https://github.com/huawei-noah/SMARTS/blob/master/README.md#rl-model
I have most things working, the main issue left is that it appears like the parallel environments are not getting unique |
02da98e
to
b05f224
Compare
a = config.worker_index | ||
b = config.vector_index | ||
c = (a + b) * (a + b + 1) // 2 + b | ||
smarts.core.seed(seed + c) | ||
self._seed = seed + c | ||
smarts.core.seed(self._seed + c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note from this PR is that the original behaviour in ray[rllib]==1.4.0
which used to result in environment seed diversity within the same trail no longer works.
Seed used to vary across all environments during a trial but is now the same within a trial. I am unsure if this has to do with how the environment configuration is distributed.
The implication is that this slows down experience gain.
examples/rl/rllib/pg_pbt_example.py
Outdated
help="Destination path of where to copy the model when training is over", | ||
) | ||
args = parser.parse_args() | ||
build_scenario(scenario=args.scenario, clean=False, seed=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, using a seed, we first build the scenarios and generate a fixed traffic set. Then, we use the same scenarios in each of the the parallel environments to collect experience. Although the parallel environments might have different seeds and the scenario order might be shuffled, the underlying traffic set appears to be the same since they were generated using the same seed. Should we be building the scenarios with different seeds within each parallel environment instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach will work only partially because unless we use a true-random seed, the seed is managed by tune
in the PBT
scheduling. From my investigation, all of the environments are identical within a trial. In that case, if we are building scenarios during the experiment, I am worried about race conditions between the environments because it is not clear how to differentiate between the environments.
The second issue is that the config is specified at the start of the configuration. To do what you ask would require copying and building the scenario at the start of each trial. I think that might be possible through the callbacks, but from what I can tell, the config is intended to be constant and I would need to do some investigation to see if it is possible.
What I am slightly surprised about (see above comment) is that the config.vector_index
and config.worker_index
values do not appear to be different across workers. And reset(seed=None)
seems to be the constant case.
The easier way of working with this is to pre-generate a set of scenario variations at the start, I will expand the number of scenarios.
b05f224
to
4c2a0ef
Compare
No description provided.