forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 4
Kiko/fix stopiteration handling #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Kiko-Aumond
wants to merge
102
commits into
releases/1.3.0
Choose a base branch
from
kiko/fix_stopiteration_handling
base: releases/1.3.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
5db8b9a
syncing to 1.7.2
dmlyubim 55bc018
common public rllib cql renames
dmlyubim 31b77f5
patching sac dist class get
dmlyubim 844dba4
retrofitting rllib/offline package to 1.7.2
dmlyubim ace3f85
retrofit space_utils 1.7.2
dmlyubim 6640af3
retrofit ray.tune.registry to 1.7.2 (add input registry)
dmlyubim a9b7a56
test changes
dmlyubim e09c33d
cql test pendulum data
dmlyubim 313f88a
in 1.3, replay buffer isn't reworked to track capacity vs. current size
dmlyubim 1e5159a
Updating metrics to 1.7.2 (update sampled count on request to enable …
dmlyubim 5f12afa
slight test refactoring to enable intermediate debugging
dmlyubim 8598a97
fixing bazel test //rllib:test_cql
dmlyubim 0c20a1b
additional cql_sac cleanup
dmlyubim 2dbfb9d
removing cql apex sac tests
dmlyubim 016bde6
rolling back non-existent policy call signature in offline component
dmlyubim e099a1d
trying to fix macos python verison at 3.8.15
dmlyubim 625bf4b
changing bazel definition for test_cql.
dmlyubim d5abccb
parity with BUILD for test_cql in 1.7.2 (removing data glob) -- does …
dmlyubim d56abda
fixes -- this now runs with the benchmark
dmlyubim fb7ef1a
Rolling back cql_dqn cleanup
dmlyubim 90660d0
trying to add data label to test
dmlyubim 057eebf
set recursive mod 777 on /home/vsts/work/_temp/_bazel_vsts directory …
Kiko-Aumond aa2eae4
use $TEST_TMPDIR env variable instead of literal directory name
Kiko-Aumond e2f9e7f
Kiko/cql 1.7.2 port (#172)
Kiko-Aumond f809b8f
brining more changes from 1.13.0 to update timesteps_total metric cor…
dmlyubim 65ddbce
Merge branch 'dmlyubim/cql-1.7.2-port' of github.com:BonsaiAI/ray int…
dmlyubim bf7c81d
REVERTING TO PYTHON 3.8 FOR MAC
dmlyubim 1f8c889
explicitly set MACOSX_DEPLOYMENT_TARGET env variable
Kiko-Aumond 3008348
Merge remote-tracking branch 'origin/dmlyubim/cql-1.7.2-port' into ki…
Kiko-Aumond fcadf4b
removed minor version of Python; renamed steps to relect correct Pyth…
Kiko-Aumond e19f432
get latest pip version to test MacOs wheels
Kiko-Aumond 2dfff9d
updated hash
Kiko-Aumond 0752873
undid changes to info,yml
Kiko-Aumond 278c16c
unbounded setuptools
Kiko-Aumond 9326393
undid change
Kiko-Aumond 1473ead
Fix MacOs version if bdist_wheel generates incorrect MacOS version ta…
Kiko-Aumond af5edfe
undid changes
Kiko-Aumond 4c89d1e
undid changes
Kiko-Aumond e0f89f8
undid changes
Kiko-Aumond 53fb100
force reinstall tune and upstream requirements
Kiko-Aumond a6543c9
updatd CI hash
Kiko-Aumond 62c1e59
updated dependencies
Kiko-Aumond e8e2720
updated requirements
Kiko-Aumond dc74f94
updated requirements
Kiko-Aumond 25c4738
updated requirements
Kiko-Aumond 4ffb1fe
updated requirements
Kiko-Aumond 3cd0f4d
updated requirements
Kiko-Aumond 9075ae5
updated ci folder hash
Kiko-Aumond 46972ee
updated requirements
Kiko-Aumond 0d91855
updated requirements
Kiko-Aumond d99f6d4
updates CI hash
Kiko-Aumond 3d7bfcb
updated requirements
Kiko-Aumond 8853407
updated requirements
Kiko-Aumond 6eeac08
updated requirements
Kiko-Aumond 9b42244
updated requirements
Kiko-Aumond fa6a757
undid requirement changes
Kiko-Aumond c6c1561
updated ci folder hash
Kiko-Aumond ed41b13
updated requirements
Kiko-Aumond 7e076a0
updated requirements
Kiko-Aumond 6316d3b
updated requirements
Kiko-Aumond 9b943ca
updated requirements
Kiko-Aumond 7fa7440
updated requirements
Kiko-Aumond a6a468f
updated requirements
Kiko-Aumond c880a18
updated requirements
Kiko-Aumond 1f024d1
updated dependencies
Kiko-Aumond 68d169f
updated requirements
Kiko-Aumond f2b8c73
updated dependencies
Kiko-Aumond ec88717
apt update
Kiko-Aumond 3e4af46
fixed GCC download, set Ubuntu 20.04 as default OS for pipeline
Kiko-Aumond 3fd1ad1
updated requirements
Kiko-Aumond 5cb8de8
updated requirements
Kiko-Aumond 17f2a7a
fixed setup.py
Kiko-Aumond 3e1e2b4
updated ci hash
Kiko-Aumond 9604db6
fixed setup.py
Kiko-Aumond bc9c7b7
fixed setup.py
Kiko-Aumond 42b0881
fixed setup.py
Kiko-Aumond 0b7d070
updated requirements
Kiko-Aumond 5124b5d
fixed setup.py
Kiko-Aumond 91df95f
force reintall of torch and torchvision
Kiko-Aumond 8041c80
updated ci hash
Kiko-Aumond 6cf662b
fixed rllib requirements
Kiko-Aumond 1cf1948
updated requirements
Kiko-Aumond 52683ff
updated requirements
Kiko-Aumond b90db59
updated requirements
Kiko-Aumond fb6bb23
updated requirements
Kiko-Aumond 0e7ba4e
updated requirements
Kiko-Aumond feea4f5
updated requirements
Kiko-Aumond a6836af
updated dependencies
Kiko-Aumond 1f33ba2
updated dependencies
Kiko-Aumond f965da8
updated requirements
Kiko-Aumond c268d44
updated requirements
Kiko-Aumond fe4a03f
updated requirements
Kiko-Aumond 5ecffd4
explicitly set locale in MacOS to fix test_signal
Kiko-Aumond e780cf1
set build Ubuntu OS version to focal
Kiko-Aumond 1402b2f
do not bubble up StopIterations in ParallelIteratr
Kiko-Aumond 4489a5c
fixed memory_monitor import
Kiko-Aumond ceab8cd
undid non-build changes
Kiko-Aumond 50dc53e
Merge remote-tracking branch 'origin/releases/1.3.0' into kiko/focal_…
Kiko-Aumond 4a09ad5
Merge remote-tracking branch 'origin/kiko/focal_build' into kiko/fix_…
Kiko-Aumond 4043d11
added new NoSamplesAvailable exception
Kiko-Aumond bed44d2
fixed exception handling
Kiko-Aumond 8ddc5e9
fixed tests
Kiko-Aumond File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ | |
| from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID | ||
| from ray.rllib.evaluation.metrics import collect_metrics | ||
| from ray.rllib.evaluation.worker_set import WorkerSet | ||
| from ray.util.iter import NoSamplesAvailable | ||
| from ray.rllib.utils import FilterManager, deep_update, merge_dicts | ||
| from ray.rllib.utils.spaces import space_utils | ||
| from ray.rllib.utils.framework import try_import_tf, TensorStructType | ||
|
|
@@ -461,6 +462,26 @@ | |
| # yapf: enable | ||
|
|
||
|
|
||
| def is_memory_error(e: Exception) -> bool: | ||
| """Check if an exception occurred due to a process running out of memory.""" | ||
| memory_error_names = [ | ||
| "ray.memory_monitor.RayOutOfMemoryError", | ||
| "RayOutOfMemoryError", | ||
| ] | ||
| ename = type(e).__name__ | ||
|
|
||
| if ename in memory_error_names: | ||
| return True | ||
|
|
||
| msg_list = list(filter(lambda s: len(s) > 0, str(e).split("\n"))) | ||
|
|
||
| if ename.startswith("RayTaskError"): | ||
| return any( | ||
| any(ename in msg for msg in msg_list) for ename in memory_error_names | ||
| ) | ||
| return False | ||
|
|
||
|
|
||
| @DeveloperAPI | ||
| def with_common_config( | ||
| extra_config: PartialTrainerConfigDict) -> TrainerConfigDict: | ||
|
|
@@ -601,20 +622,40 @@ def train(self) -> ResultDict: | |
| for _ in range(1 + MAX_WORKER_FAILURE_RETRIES): | ||
| try: | ||
| result = Trainable.train(self) | ||
| except RayError as e: | ||
| if self.config["ignore_worker_failures"]: | ||
| logger.exception( | ||
| "Error in train call, attempting to recover") | ||
| self._try_recover() | ||
| else: | ||
| logger.info( | ||
| "Worker crashed during call to train(). To attempt to " | ||
| "continue training without the failed worker, set " | ||
| "`'ignore_worker_failures': True`.") | ||
| raise e | ||
| except Exception as e: | ||
| time.sleep(0.5) # allow logs messages to propagate | ||
| raise e | ||
| if issubclass(e, RayError): | ||
| # do not retry in case of OOM errors | ||
| if is_memory_error(e): | ||
| logger.exception("Not attempting to recover from error in train call " | ||
| "since it was caused by an OOM error", | ||
| exc_info=e) | ||
| time.sleep(0.5) # allow logs messages to propagate | ||
| raise e | ||
| else: | ||
| # always retry on NoSamplesAvailable as this is by definition | ||
| # a retryable situation | ||
| if isinstance(e, NoSamplesAvailable): | ||
| logger.info("No samples available yet, retrying.") | ||
| self._try_recover() | ||
| elif self.config["ignore_worker_failures"]: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. |
||
| logger.exception("Error in train call and ignore_worker_failures==True, " | ||
| "attempting to recover", | ||
| exc_info=e) | ||
| self._try_recover() | ||
| else: | ||
| logger.info( | ||
| "Worker crashed during call to train(). To attempt to " | ||
| "continue training without the failed worker, set " | ||
| "`'ignore_worker_failures': True`.") | ||
| raise e | ||
| else: | ||
| if isinstance(e, StopIteration): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need to handle it here. |
||
| pass | ||
| else: | ||
| logger.exception("Not attempting to recover from error in train call", | ||
| exc_info=e) | ||
| time.sleep(0.5) # allow logs messages to propagate | ||
| raise e | ||
| else: | ||
| break | ||
| if result is None: | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than catching it here and doing
_try_recover(), we need to handle it in our bonsai code (I'll show you).