Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Apex never instantiated. #7274

Merged
merged 18 commits into from
Apr 30, 2021
Merged

[bugfix] Apex never instantiated. #7274

merged 18 commits into from
Apr 30, 2021

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Apr 29, 2021

What does this PR do?

This PR adds a dispatch to accelerator as it is needed for properly instantiating Apex with ddp_spawn.

Fixes #7271

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@tchaton tchaton added this to the v1.3 milestone Apr 29, 2021
@tchaton tchaton self-assigned this Apr 29, 2021
@tchaton tchaton changed the title [bugfix] Apex was broken [bugfix] Apex never instantiated. Apr 29, 2021
@codecov
Copy link

codecov bot commented Apr 29, 2021

Codecov Report

Merging #7274 (b48017d) into master (44fd017) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #7274   +/-   ##
======================================
- Coverage      91%     91%   -0%     
======================================
  Files         200     200           
  Lines       12850   12854    +4     
======================================
- Hits        11730   11722    -8     
- Misses       1120    1132   +12     

@tchaton tchaton added bug Something isn't working priority: 0 High priority task labels Apr 29, 2021
@SeanNaren
Copy link
Contributor

SeanNaren commented Apr 29, 2021

Just to ensure I understand from a high level, because the model is only moved to device after calling connect in the precision plugin, you've introduced dispatch that is called after pre_dispatch which allows the precision plugin to be called after the model has moved to the correct device? This fixes the Apex issue as the model has to be moved to CUDA first.

EDIT: after speaking to @tchaton he confirmed, as well as mentioned that this now works for ddp spawn which is pretty neat :)

@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None:
self.setup_optimizers(trainer)
self.precision_plugin.pre_dispatch()

def dispatch(self, trainer: 'pl.Trainer') -> None:
"""Hook to do something before the training/evaluation/prediction starts."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to clear up somewhere here that this happens after accelerator setup? Otherwise this looks the same as pre_dispatch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton
As the order of hooks being executed could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find the pre/dispatch/post confusing now :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should think about the naming of these hooks. But more importantly, I think we can do a better job at formally defining what these hook are supposed to do. Maybe another action item for 1.3 is to do a full pass over the plugins and improve all these docs. That would help everyone 1. implementing plugins 2. fix 3. review Plugin PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: would this be easier if the precision plugin was owned by the training type plugin instead of the accelerator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that could make it easier to interleave these operations between one plugin and the other. Here in this PR we see that the precision plugin needs to configure the model before it is wrapped, and needs to overwrite the reference in the training plugin. this really breaks the contract that these plugins currently have with each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should definitely refactor this and move optimizers, lr_schedulers to the training_type_plugin.

Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but definitely would like to see @justusschock or @awaelchli comments!

@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None:
self.setup_optimizers(trainer)
self.precision_plugin.pre_dispatch()

def dispatch(self, trainer: 'pl.Trainer') -> None:
"""Hook to do something before the training/evaluation/prediction starts."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find the pre/dispatch/post confusing now :/

Comment on lines 46 to 54
def dispatch(self, trainer: "pl.Trainer") -> None:
if not self._connected:
accelerator = trainer.accelerator
model, optimizers = self.configure_apex(accelerator.lightning_module, accelerator.optimizers)
self.reinit_scheduler_properties(optimizers, accelerator.lr_schedulers)
self.model = model
self.optimizers = optimizers
self._connected = True
return super().dispatch(trainer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not specific to apex. any optimizer that defines state which isn't lazily instantiated needs to handle the device move?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ananthsub. Not sure to fully follow you :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I think this is handled by @awaelchli in #7277

@tchaton
Copy link
Contributor Author

tchaton commented Apr 29, 2021

Hey @ananthsub,

I will make another PR to rename them. on_predispatch, on_dispatch, etc..

Best,
T.C

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cautious approval
please see my comments :))

@@ -107,6 +107,11 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None:
self.setup_optimizers(trainer)
self.precision_plugin.pre_dispatch()

def dispatch(self, trainer: 'pl.Trainer') -> None:
"""Hook to do something before the training/evaluation/prediction starts."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should think about the naming of these hooks. But more importantly, I think we can do a better job at formally defining what these hook are supposed to do. Maybe another action item for 1.3 is to do a full pass over the plugins and improve all these docs. That would help everyone 1. implementing plugins 2. fix 3. review Plugin PRs.

Comment on lines +103 to +105
@pytest.mark.parametrize("amp_level", ['O2'])
def test_amp_apex_ddp_fit(amp_level, tmpdir):

Copy link
Contributor

@awaelchli awaelchli Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another apex test in
tests/models/test_amp.py::test_amp_with_apex It uses 1 gpu.
@tchaton your fix only applies to multi-gpu due to the dispatch, am I right? single gpu seems to be ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also carefully approving. I get that this is necessary, but I don't like simply assigning optimiser and model to every class :)

@mergify mergify bot removed the has conflicts label Apr 30, 2021
@pep8speaks
Copy link

pep8speaks commented Apr 30, 2021

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-30 16:50:17 UTC

@mergify mergify bot removed the has conflicts label Apr 30, 2021
@tchaton tchaton added the admin Requires admin privileges to merge label Apr 30, 2021
@lexierule lexierule merged commit 16d6c98 into master Apr 30, 2021
@lexierule lexierule deleted the apex_resolve_bug branch April 30, 2021 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
admin Requires admin privileges to merge bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lightning + DDP + apex.amp fails to initialize
8 participants