-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/cuda oom detection and handling #6934
Bugfix/cuda oom detection and handling #6934
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6934 +/- ##
=======================================
- Coverage 92% 87% -5%
=======================================
Files 194 194
Lines 12410 12414 +4
=======================================
- Hits 11428 10777 -651
- Misses 982 1637 +655 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes #6819, right? If so, could you update the description of this PR?
return isinstance(exception, RuntimeError) \ | ||
and len(exception.args) == 1 \ | ||
and "CUDA out of memory." in exception.args[0] | ||
and "CUDA" in exception.args[0] \ | ||
and "out of memory" in exception.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be easier to read if the result will be in a var and then just return it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but was trying to as non-invasive as possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good invasion is always welcome :]
What does this PR do?
This PR address the
is_cuda_out_of_memory
function missing some CUDA OOM errors as it only looks for a single fixed string. The problem in my case turned out to me a little more complex actually as thegarbage_collection_cuda
was itself generating OOM errors, so I had to handle those too.Fixes #958
Fixes #6819
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃