Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/4] Add get_device_stats to accelerator interface #9586

Merged
merged 31 commits into from
Sep 27, 2021

Conversation

daniellepintz
Copy link
Contributor

@daniellepintz daniellepintz commented Sep 17, 2021

What does this PR do?

Adds get_device_stats function to accelerator class, and implements it for GPU and TPU. To be called in later PRs
Tested with daniellepintz#2

Part of #9032

Please also see subsequent PRs:
daniellepintz#2
daniellepintz#3
daniellepintz#4

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot removed the has conflicts label Sep 17, 2021
pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/tpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/tpu.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

Merging #9586 (ccadca5) into master (c7451b3) will decrease coverage by 4%.
The diff coverage is 28%.

❗ Current head ccadca5 differs from pull request most recent head 07bc597. Consider uploading reports for the commit 07bc597 to get more accurate results

@@           Coverage Diff           @@
##           master   #9586    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         181     179     -2     
  Lines       15349   15350     +1     
=======================================
- Hits        14246   13591   -655     
- Misses       1103    1759   +656     

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Sep 21, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 21, 2021

Hey @daniellepintz, thanks for linking all the following PRs :)

noob question for PR 2/4. Should we differentiate on_train_batch_start and on_train_batch_end ?

Best,
T.C

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniellepintz it'd be good to also add tests for the GPU and TPU accelerator to confirm that get_device_stats returns something

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved
@daniellepintz
Copy link
Contributor Author

daniellepintz commented Sep 21, 2021

noob question for PR 2/4. Should we differentiate on_train_batch_start and on_train_batch_end ?

@tchaton this was modeled after GPUStatsMonitor where we log in both places. I myself am not sure what is best so please lmk if you think I should change this

@daniellepintz
Copy link
Contributor Author

@ananthsub I believe any tests I would write would be similar to the ones in https://github.com/daniellepintz/pytorch-lightning/pull/2/files, or did you have something else in mind?

@ananthsub
Copy link
Contributor

@ananthsub I believe any tests I would write would be similar to the ones in https://github.com/daniellepintz/pytorch-lightning/pull/2/files, or did you have something else in mind?

the tests there are end to end tests relying on multiple components: the accelerator, callback, and trainer.

the accelerator tests would test similar functionality but should be more self-contained as a unit test specifically for get_device_stats. for example, we could instantiate some large tensors on gpu, and ensure that we get corresponding keys back from the GPU Accelerator as a sanity check.

having the tests here & in the callback will be helpful to detect if something breaks in one place vs another

pytorch_lightning/accelerators/tpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/cpu.py Show resolved Hide resolved
@daniellepintz
Copy link
Contributor Author

Looking at the output of the CI TPU tests I see two odd things:

  1. There are a bunch of RuntimeErrors but it says at the bottom that all the tests passed (this is the case on other PRs as well)
  2. I can't find test_device_stats_tpu in the output (added in this PR), and it doesn't seem to be running it.

Does anyone have any insight into these issues?

@mergify mergify bot removed the has conflicts label Sep 23, 2021
pytorch_lightning/accelerators/gpu.py Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Show resolved Hide resolved
pytorch_lightning/accelerators/tpu.py Outdated Show resolved Hide resolved
@tchaton
Copy link
Contributor

tchaton commented Sep 23, 2021

Hey @kaushikb11 Mind looking into @daniellepintz TPU testing ?

@daniellepintz daniellepintz merged commit ab06987 into Lightning-AI:master Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants