Fixes access to callback_metrics in ddp_spawn #7916

edgarriba · 2021-06-10T10:08:20Z

What does this PR do?

Fixes #7671
Fixes access to callback_metrics in ddp_spawn

TODO:

tests
docs

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-06-10T10:08:24Z

Hello @edgarriba! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-17 13:37:47 UTC

codecov · 2021-06-10T10:09:55Z

Codecov Report

Merging #7916 (6a6ca3b) into master (b71aa55) will increase coverage by 0%.
The diff coverage is 94%.

@@           Coverage Diff            @@
##           master   #7916     +/-   ##
========================================
  Coverage      92%     92%             
========================================
  Files         207     211      +4     
  Lines       13375   14557   +1182     
========================================
+ Hits        12245   13347   +1102     
- Misses       1130    1210     +80

pytorch_lightning/plugins/training_type/ddp_spawn.py

pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py

carmocca

Should we put the callback metrics directly in the queue instead?
We don't want users to have to use a different attribute depending on the accelerator.

How would this impact performance?

edgarriba · 2021-06-10T10:45:04Z

@carmocca My initial proposal using the optuna framework as entry point example is the following:

return trainer.spawn_callback_metrics["val_acc"]

however, as @tchaton proposes to make it more generic we could follow the approach below

return trainer.spawn_extra_parameters["callback_metrics"]["val_acc"]

open for an api discussion

** this is the gist to entry point script: https://gist.github.com/edgarriba/af6247edb32586b19e740f17735ff055

for more information, see https://pre-commit.ci

edgarriba · 2021-06-17T08:04:13Z

@carmocca @awaelchli your comments I believe that were addressed

awaelchli

great
I think just the changelog entries missing now.
Adding 1.4 milestone

for more information, see https://pre-commit.ci

pytorch_lightning/core/lightning.py

Co-authored-by: Carlos Mocholí <[email protected]>

ananthsub · 2021-06-25T05:54:40Z

pytorch_lightning/core/lightning.py

+
+    def add_to_queue(self, queue: torch.multiprocessing.SimpleQueue) -> None:
+        """Appends the :attr:`trainer.callback_metrics` dictionary to the given queue.
+
+        To avoid issues with memory sharing, we cast the data to numpy.
+
+        Args:
+            queue: the instance of the queue to append the data.
+        """
+        callback_metrics: dict = apply_to_collection(
+            self.trainer.callback_metrics, torch.Tensor, lambda x: x.cpu().numpy()
+        )  # send as numpy to avoid issues with memory sharing
+        queue.put(callback_metrics)
+
+    def get_from_queue(self, queue: torch.multiprocessing.SimpleQueue) -> None:
+        """Retrieve the :attr:`trainer.callback_metrics` dictionary from the given queue.
+
+        To preserve consistency, we cast back the data to ``torch.Tensor``.
+
+        Args:
+            queue: the instance of the queue from where to get the data.
+        """
+        # NOTE: `add_to_queue` needs to be called before
+        callback_metrics: dict = queue.get()
+        self.trainer.callback_metrics.update(
+            apply_to_collection(callback_metrics, np.ndarray, lambda x: torch.tensor(x))
+        )


is this the only alternative to populate these metrics? why is this on a user the user interace of the lightning module? what happens if someone overrides this? is it meant to be overridden?

it feels like the lightning module is used as a go-between between different parts of the trainer, in particular because the training type plugin technically has no reference to the trainer.

structrually, we are repeatedly reaching through the lightning module to access the trainer in a very roundabout way. another example: https://github.com/PyTorchLightning/pytorch-lightning/blob/55a90af7fc0805855684e93dbad669f5bbe76eee/pytorch_lightning/plugins/training_type/sharded.py#L42-L57

it feels backwards and it also makes efforts like #7315 harder to work through when we keep relying on the trainer like this

awaelchli · 2021-06-28T07:39:29Z

Do add_to_queue and get_from_queue need to be public? @ananthsub suggests them to be protected.

carmocca · 2021-06-29T14:10:30Z

Do add_to_queue and get_from_queue need to be public?

The point was to let users add and get from these. See #7916 (comment) and the rest of the discussions in this PR

daniellepintz · 2021-08-25T22:32:59Z

pytorch_lightning/plugins/training_type/tpu_spawn.py

@@ -202,6 +202,7 @@ def transfer_distrib_spawn_state_on_fit_end(self, results):
                self.mp_queue.put(best_model_path)
                self.mp_queue.put(last_path)
                self.mp_queue.put(results)
+                self.lightning_module.add_to_queue(self.mp_queue)  # adds the `callback_metrics` to the queue

    def save(self, state_dict: Dict, path: str) -> None:


@edgarriba is there a reason you add_to_q in tpu_spawn, but dont get_from_q?

ah nevermind, I think its just bc tpu_spawn doesnt override post_dispatch

daniellepintz · 2021-08-25T22:50:59Z

pytorch_lightning/core/lightning.py

+        """
+        # NOTE: `add_to_queue` needs to be called before
+        callback_metrics: dict = queue.get()
+        self.trainer.callback_metrics.update(


why do we have to update the callback metrics here?

I'll answer for Edgar:

The purpose of this PR was to provide a mechanism for users to add items to consume from callbacks in the spawn environment.

Hence why we update callback metrics here. Callbacks read metrics off that dictionary.

@carmocca thanks so much!!

add spawn_callback_metrics

0567de9

edgarriba requested review from tchaton and carmocca June 10, 2021 10:08

tchaton reviewed Jun 10, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp_spawn.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/ddp_spawn.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py Outdated Show resolved Hide resolved

edgarriba added 2 commits June 10, 2021 12:34

use apply_to_collection

e5fd47c

Merge branch 'master' into edgar/feat/spawn_args

d6d6c19

carmocca reviewed Jun 10, 2021

View reviewed changes

edgarriba force-pushed the edgar/feat/spawn_args branch from a2bb4ac to d6d6c19 Compare June 10, 2021 10:39

edgarriba added 3 commits June 10, 2021 13:29

make more generic spawn_extra_parameters

6e9715c

generalise a bit more mp_queue parameters

a1b3865

Merge branch 'master' into edgar/feat/spawn_args

9b5a97c

edgarriba force-pushed the edgar/feat/spawn_args branch from 9aee3ac to 9b5a97c Compare June 10, 2021 11:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

c637998

for more information, see https://pre-commit.ci

edgarriba force-pushed the edgar/feat/spawn_args branch from 88dd15e to c637998 Compare June 10, 2021 12:40

edgarriba added 2 commits June 10, 2021 17:13

implement single and multi gpu tests

97fd769

fix typing

2eb0836

edgarriba marked this pull request as ready for review June 10, 2021 15:29

edgarriba requested review from awaelchli, Borda, justusschock, kaushikb11, SeanNaren and williamFalcon as code owners June 10, 2021 15:29

mergify bot added the has conflicts label Jun 10, 2021

edgarriba added distributed Generic distributed-related topic feature Is an improvement or enhancement metrics labels Jun 10, 2021

pre-commit-ci bot and others added 2 commits June 16, 2021 17:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

b016585

for more information, see https://pre-commit.ci

undo typing

0cd331f

edgarriba force-pushed the edgar/feat/spawn_args branch from 1671e57 to 0cd331f Compare June 16, 2021 18:14

add small test for add/get queue

90ff74e

edgarriba force-pushed the edgar/feat/spawn_args branch from 82bcdbf to 90ff74e Compare June 16, 2021 18:43

edgarriba added 2 commits June 16, 2021 21:03

update ddp test for 2 gpus

5c06cf2

add missing docs

8ae14e6

edgarriba requested a review from edenlightning as a code owner June 17, 2021 08:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

4aa8d9a

for more information, see https://pre-commit.ci

awaelchli approved these changes Jun 17, 2021

View reviewed changes

awaelchli added this to the v1.4 milestone Jun 17, 2021

edgarriba changed the title ~~add spawn_callback_metrics~~ Fixes access to callback_metrics in ddp_spawn Jun 17, 2021

add fix to changelog

6c219e1

edgarriba force-pushed the edgar/feat/spawn_args branch from 6c8e0a9 to 6c219e1 Compare June 17, 2021 11:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

c441b52

for more information, see https://pre-commit.ci

justusschock approved these changes Jun 17, 2021

View reviewed changes

carmocca approved these changes Jun 17, 2021

View reviewed changes

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved

Update pytorch_lightning/core/lightning.py

6a6ca3b

Co-authored-by: Carlos Mocholí <[email protected]>

carmocca merged commit b378806 into master Jun 23, 2021

carmocca deleted the edgar/feat/spawn_args branch June 23, 2021 01:19

ananthsub reviewed Jun 25, 2021

View reviewed changes

tohmae mentioned this pull request Jul 27, 2021

[WIP] Support PyTorch-lightning DDP training optuna/optuna#2824

Closed

tchaton mentioned this pull request Aug 18, 2021

Deprecate add_to_queue / get_from_queue #8940

Closed

daniellepintz reviewed Aug 25, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Aug 25, 2021

daniellepintz reviewed Aug 25, 2021

View reviewed changes

tohmae mentioned this pull request Sep 23, 2021

Support PyTorch-lightning DDP training optuna/optuna#2849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes access to callback_metrics in ddp_spawn #7916

Fixes access to callback_metrics in ddp_spawn #7916

edgarriba commented Jun 10, 2021 •

edited

Loading

pep8speaks commented Jun 10, 2021 •

edited

Loading

codecov bot commented Jun 10, 2021 •

edited

Loading

carmocca left a comment

edgarriba commented Jun 10, 2021

edgarriba commented Jun 17, 2021 •

edited

Loading

awaelchli left a comment

ananthsub Jun 25, 2021

awaelchli commented Jun 28, 2021

carmocca commented Jun 29, 2021 •

edited

Loading

daniellepintz Aug 25, 2021

daniellepintz Aug 25, 2021

daniellepintz Aug 25, 2021

carmocca Aug 26, 2021

daniellepintz Aug 26, 2021

Fixes access to callback_metrics in ddp_spawn #7916

Fixes access to callback_metrics in ddp_spawn #7916

Conversation

edgarriba commented Jun 10, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Jun 10, 2021 • edited Loading

Comment last updated at 2021-06-17 13:37:47 UTC

codecov bot commented Jun 10, 2021 • edited Loading

Codecov Report

carmocca left a comment

Choose a reason for hiding this comment

edgarriba commented Jun 10, 2021

edgarriba commented Jun 17, 2021 • edited Loading

awaelchli left a comment

Choose a reason for hiding this comment

ananthsub Jun 25, 2021

Choose a reason for hiding this comment

awaelchli commented Jun 28, 2021

carmocca commented Jun 29, 2021 • edited Loading

daniellepintz Aug 25, 2021

Choose a reason for hiding this comment

daniellepintz Aug 25, 2021

Choose a reason for hiding this comment

daniellepintz Aug 25, 2021

Choose a reason for hiding this comment

carmocca Aug 26, 2021

Choose a reason for hiding this comment

daniellepintz Aug 26, 2021

Choose a reason for hiding this comment

edgarriba commented Jun 10, 2021 •

edited

Loading

pep8speaks commented Jun 10, 2021 •

edited

Loading

codecov bot commented Jun 10, 2021 •

edited

Loading

edgarriba commented Jun 17, 2021 •

edited

Loading

carmocca commented Jun 29, 2021 •

edited

Loading