-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware specific parts of Accelerator Refactoring #5719
Conversation
Hello @justusschock! Thanks for updating this PR.
Comment last updated at 2021-02-01 12:37:53 UTC |
pytorch_lightning/plugins/training_type/training_type_plugin.py
Outdated
Show resolved
Hide resolved
Co-Authored with @awaelchi
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
a950f48
to
398abb9
Compare
Co-Authored with @awaelchi
Codecov Report
@@ Coverage Diff @@
## release/1.2-dev #5719 +/- ##
================================================
- Coverage 89% 86% -4%
================================================
Files 173 181 +8
Lines 12495 13110 +615
================================================
+ Hits 11175 11226 +51
- Misses 1320 1884 +564 |
3019414
to
1085a23
Compare
seems as last issue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justusschock Really good work there ! Not all comments should be adressed in this PR.
|
||
if self.is_global_zero: | ||
# load weights saved in ddp | ||
path = os.path.join(original_model.trainer.default_root_dir, "__temp_weight_distributed_end.ckpt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like __temp_weight_distributed_end.ckpt
should be a GLOBAL property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by global property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class
TEMPORARY_WEIGHT_PATH = "__temp_weight_distributed_end.ckpt"
def __init__
...
def
path = os.path.join(original_model.trainer.default_root_dir, self.TEMPORARY_WEIGHT_PATH)
|
||
# load weights if not interrupted | ||
# TODO: check for trainer reference | ||
if self.on_colab_kaggle and not model.trainer.testing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a ColabEnv for TPU
and move some logic there ? In another PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I completely understand this... You mean a ClusterEnvironment
for that? We can certainly look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure Colab is a ClusterEnviromenent
.
I meant more something like TPURunningEnviromenent
-> Colab, Kaggle_colab, etc...
Best,
T.C
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
What does this PR do?
Adds the Hardware specific parts of the refactoring (#5616 ) as Accelerators for CPU, GPU and TPU, the accelerator connector and the specific plugins for single-device training, single tpu training and multiple tpu training (TPUSpawn).
Only files with new classes are added and some imports are changed. No integration into Trainer yet.
Should be merged after #5715 #5718 #5714