-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Apex never instantiated. #7274
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7274 +/- ##
======================================
- Coverage 91% 91% -0%
======================================
Files 200 200
Lines 12850 12854 +4
======================================
- Hits 11730 11722 -8
- Misses 1120 1132 +12 |
Just to ensure I understand from a high level, because the model is only moved to device after calling EDIT: after speaking to @tchaton he confirmed, as well as mentioned that this now works for ddp spawn which is pretty neat :) |
@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None: | |||
self.setup_optimizers(trainer) | |||
self.precision_plugin.pre_dispatch() | |||
|
|||
def dispatch(self, trainer: 'pl.Trainer') -> None: | |||
"""Hook to do something before the training/evaluation/prediction starts.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to clear up somewhere here that this happens after accelerator setup? Otherwise this looks the same as pre_dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaton
As the order of hooks being executed could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find the pre/dispatch/post confusing now :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should think about the naming of these hooks. But more importantly, I think we can do a better job at formally defining what these hook are supposed to do. Maybe another action item for 1.3 is to do a full pass over the plugins and improve all these docs. That would help everyone 1. implementing plugins 2. fix 3. review Plugin PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: would this be easier if the precision plugin was owned by the training type plugin instead of the accelerator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that could make it easier to interleave these operations between one plugin and the other. Here in this PR we see that the precision plugin needs to configure the model before it is wrapped, and needs to overwrite the reference in the training plugin. this really breaks the contract that these plugins currently have with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should definitely refactor this and move optimizers, lr_schedulers to the training_type_plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but definitely would like to see @justusschock or @awaelchli comments!
@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None: | |||
self.setup_optimizers(trainer) | |||
self.precision_plugin.pre_dispatch() | |||
|
|||
def dispatch(self, trainer: 'pl.Trainer') -> None: | |||
"""Hook to do something before the training/evaluation/prediction starts.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find the pre/dispatch/post confusing now :/
def dispatch(self, trainer: "pl.Trainer") -> None: | ||
if not self._connected: | ||
accelerator = trainer.accelerator | ||
model, optimizers = self.configure_apex(accelerator.lightning_module, accelerator.optimizers) | ||
self.reinit_scheduler_properties(optimizers, accelerator.lr_schedulers) | ||
self.model = model | ||
self.optimizers = optimizers | ||
self._connected = True | ||
return super().dispatch(trainer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not specific to apex. any optimizer that defines state which isn't lazily instantiated needs to handle the device move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ananthsub. Not sure to fully follow you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananthsub I think this is handled by @awaelchli in #7277
Hey @ananthsub, I will make another PR to rename them. Best, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cautious approval
please see my comments :))
@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None: | |||
self.setup_optimizers(trainer) | |||
self.precision_plugin.pre_dispatch() | |||
|
|||
def dispatch(self, trainer: 'pl.Trainer') -> None: | |||
"""Hook to do something before the training/evaluation/prediction starts.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should think about the naming of these hooks. But more importantly, I think we can do a better job at formally defining what these hook are supposed to do. Maybe another action item for 1.3 is to do a full pass over the plugins and improve all these docs. That would help everyone 1. implementing plugins 2. fix 3. review Plugin PRs.
@pytest.mark.parametrize("amp_level", ['O2']) | ||
def test_amp_apex_ddp_fit(amp_level, tmpdir): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have another apex test in
tests/models/test_amp.py::test_amp_with_apex
It uses 1 gpu.
@tchaton your fix only applies to multi-gpu due to the dispatch, am I right? single gpu seems to be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also carefully approving. I get that this is necessary, but I don't like simply assigning optimiser and model to every class :)
Hello @tchaton! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-30 16:50:17 UTC |
What does this PR do?
This PR adds a
dispatch
to accelerator as it is needed for properly instantiating Apex with ddp_spawn.Fixes #7271
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃