[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

sven1977 · 2024-06-03T08:52:34Z

Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint.

Contains also a bug fix for MetricsLogger and Stats and a small API change wrt. MetricsLogger.peek() (key instead of *key to unify signature with all the other methods of MetricsLogger).

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…nup_examples_folder_14_continue_training_from_checkpoint

Signed-off-by: sven1977 <[email protected]>

simonsays1980

LGTM. Invaluable example for users!

simonsays1980 · 2024-06-03T09:17:39Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+    tuner = tune.Tuner(
+        trainable=config.algo_class,
+        param_space=config,
+        run_config=air.RunConfig(


In regard to the future deprecation of air: Can we use ray.train.RunConfig here instead?

simonsays1980 · 2024-06-03T09:18:21Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+        param_space=config,
+        run_config=air.RunConfig(
+            callbacks=tune_callbacks,
+            checkpoint_config=air.CheckpointConfig(


Same here: can we use ray.train.CheckpointConfig?

simonsays1980 · 2024-06-03T09:22:33Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+    results = tuner.fit()
+    experiment_name = Path(results.experiment_path).name
+
+    # Extract the latest checkpoint from the results and confirm it's the right one.


Let's state this comment differently. The get_best_result gets us only in this specific setup and only with a checkpoint frequency of 1 the latest checkpoint, otherwise we get the one with the highest episode_return_mean from whenever this happened.

Ah, good catch! Yes, in this example, we should probably just use the last checkpoint, not necessarily the best. ...

simonsays1980 · 2024-06-03T09:25:59Z

rllib/examples/checkpoints/restore_1_of_n_agents_from_checkpoint.py

-        # TODO (simon): Change to -800 once the metrics are fixed. Currently
-        # the combined return is not correctly computed.
-        f"{ENV_RUNNER_RESULTS}/episode_return_mean": -400,
+        f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": -800,


Great catch!

Signed-off-by: sven1977 <[email protected]>

…nup_examples_folder_14_continue_training_from_checkpoint

Signed-off-by: sven1977 <[email protected]>

…r how to resume a tune.Tuner.fit() experiment from a checkpoint. (ray-project#45681) Signed-off-by: Richard Liu <[email protected]>

sven1977 added 3 commits June 1, 2024 14:28

wip

840987d

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into clea…

f7c1e8c

…nup_examples_folder_14_continue_training_from_checkpoint

wip

9d7cf93

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from ArturNiederfahrenhorst and simonsays1980 as code owners June 3, 2024 08:52

sven1977 assigned simonsays1980 Jun 3, 2024

wip

cca427d

Signed-off-by: sven1977 <[email protected]>

simonsays1980 approved these changes Jun 3, 2024

View reviewed changes

sven1977 added 2 commits June 3, 2024 12:08

small fixes

21678bb

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into clea…

ca00a0b

…nup_examples_folder_14_continue_training_from_checkpoint

sven1977 enabled auto-merge (squash) June 3, 2024 10:08

github-actions bot disabled auto-merge June 3, 2024 10:08

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 3, 2024

sven1977 enabled auto-merge (squash) June 3, 2024 10:18

fix

1ed0f60

Signed-off-by: sven1977 <[email protected]>

github-actions bot disabled auto-merge June 3, 2024 11:38

sven1977 enabled auto-merge (squash) June 3, 2024 13:28

sven1977 merged commit 440aa81 into ray-project:master Jun 3, 2024

sven1977 deleted the cleanup_examples_folder_14_continue_training_from_checkpoint branch June 3, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

Uh oh!

sven1977 commented Jun 3, 2024 •

edited

Loading

Uh oh!

simonsays1980 left a comment

Uh oh!

simonsays1980 Jun 3, 2024

Uh oh!

sven1977 Jun 3, 2024

Uh oh!

simonsays1980 Jun 3, 2024

Uh oh!

sven1977 Jun 3, 2024

Uh oh!

simonsays1980 Jun 3, 2024

Uh oh!

sven1977 Jun 3, 2024

Uh oh!

simonsays1980 Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

Uh oh!

Conversation

sven1977 commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

simonsays1980 left a comment

Choose a reason for hiding this comment

Uh oh!

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sven1977 commented Jun 3, 2024 •

edited

Loading