Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fig logging with log_gpu_memory='min_max' #9013

Merged
merged 6 commits into from
Aug 24, 2021

Conversation

SkafteNicki
Copy link
Member

@SkafteNicki SkafteNicki commented Aug 20, 2021

What does this PR do?

Fixes #9010

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Aug 20, 2021

Codecov Report

Merging #9013 (1b90c77) into master (ad3f183) will decrease coverage by 4%.
The diff coverage is 0%.

@@           Coverage Diff           @@
##           master   #9013    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         175     175            
  Lines       14399   14381    -18     
=======================================
- Hits        13339   12738   -601     
- Misses       1060    1643   +583     

@carmocca
Copy link
Contributor

I think we should consider reverting the changes in #8174 and exploring other options.

See the open review comments in the PR.

Additionally, calling lightning_module.log from the LoggerConnector seems like a very bad design pattern which we should avoid. As that function is meant to be used by users.

Since it's quite possible that log_gpu_memory gets removed in favor of the GPUStatsMonitor callback, I think we should instead lightning_module.log within that callback. cc @ananthsub

These comments are not strictly related to the changes you did here @SkafteNicki. Just commenting here because reverting it would make this patch unnecessary.

@ananthsub
Copy link
Contributor

ananthsub commented Aug 20, 2021

@carmocca I agree, I don't think this logic needs to sit in the trainer this way. calling the lightning module's log from within the trainer, which under the hood actually calls back into the trainer is too hard to keep track of :(

mode=min_max is specific to nvidia-smi. There's another open issue around using torch.cuda.memory directly for logging, especially for pytorch 1.8 and above: #8780

The more places that rely on mode=min_max the harder this will be to transition, so I'm in favor of corralling this into one place

cc @edward-io

@ananthsub ananthsub added the logging Related to the `LoggerConnector` and `log()` label Aug 20, 2021
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carmocca @ananthsub while I agree #8174 was not optimal, it broke what is being fixed here so I think we should still go ahead with Nicki's fix. And we have the adjustments to the test as well.

@mergify mergify bot added the ready PRs ready to be merged label Aug 22, 2021
@ananthsub
Copy link
Contributor

@carmocca @ananthsub while I agree #8174 was not optimal, it broke what is being fixed here so I think we should still go ahead with Nicki's fix. And we have the adjustments to the test as well.

I agree, we have to fix forward here. I filed #9032 to see how we can better formalize this

@carmocca carmocca added this to the v1.4.x milestone Aug 22, 2021
@awaelchli awaelchli merged commit 81145ca into Lightning-AI:master Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logging Related to the `LoggerConnector` and `log()` ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log_gpu_memory='min_max' leads to error in parsing metrics keys
5 participants