-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Move log value to cpu. #4592
Conversation
@@ -136,6 +136,10 @@ def log( | |||
if sync_dist and isinstance(value, (torch.Tensor, numbers.Number)): | |||
value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op) | |||
|
|||
# no need to keep on gpu | |||
if isinstance(value, torch.Tensor) and value.is_cuda: | |||
value = value.cpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there also be a detach()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# no metrics should be logged with graphs
if not enable_graph and isinstance(value, torch.Tensor):
value = value.detach()
# sync across workers when using distributed training
sync_fn = sync_fn or sync_ddp_if_available
if sync_dist and isinstance(value, (torch.Tensor, numbers.Number)):
value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op)
# no need to keep on gpu
if isinstance(value, torch.Tensor) and value.is_cuda:
value = value.cpu()
detach is called just before.
pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py
Outdated
Show resolved
Hide resolved
Have you also checked the same thing with validation loop? |
Hey @Vozf, Did you join Slack ? If no, please do. So we can quickly resolve this bug :) Best, |
Joined now as Vozf. I don't have any more questions except if it works for validation and testing if it fixes the bug in general :) |
Codecov Report
@@ Coverage Diff @@
## master #4592 +/- ##
======================================
- Coverage 93% 93% -0%
======================================
Files 116 116
Lines 8883 8907 +24
======================================
+ Hits 8278 8291 +13
- Misses 605 616 +11 |
Do we need a test for this as well? (assert tensor device cpu) |
…chLightning/pytorch-lightning into bugfix/4556_gpu_memory_log
lets try the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine to me, just some minor question
Co-authored-by: Justus Schock <[email protected]>
pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Jirka Borovec <[email protected]>
…onnector.py Co-authored-by: Jirka Borovec <[email protected]>
@edenlightning swapping this to 1.1 since its tied to the logging refactor. |
* move value to cpu to save memory * update * move to cpu * try something * update * update * add back out_dict.update({k: v}) * add move_metrics_to_cpu * update * Update pytorch_lightning/utilities/memory.py Co-authored-by: Justus Schock <[email protected]> * resolve comments * Update pytorch_lightning/core/step_result.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Sean Naren <[email protected]>
What does this PR do?
This PR introduces fixes for gpu leak.
It adds a
move_metrics_to_cpu
parameter to the Trainer to force all result objects to be moved to cpu.Fixes #4556
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃