[bugfix] Resolve memory not logged when missing metrics #8174

tchaton · 2021-06-28T12:18:38Z

What does this PR do?

This PR adds gpus_metrics to ResultCollection + filter the non requested gpus for logging

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

…om/PyTorchLightning/pytorch-lightning into bugfix/8159_log_gpu_memory_on_step

codecov · 2021-06-28T12:20:08Z

Codecov Report

Merging #8174 (319ad41) into master (2a372e3) will decrease coverage by 5%.
The diff coverage is 62%.

@@           Coverage Diff           @@
##           master   #8174    +/-   ##
=======================================
- Coverage      93%     88%    -5%     
=======================================
  Files         211     211            
  Lines       13440   13450    +10     
=======================================
- Hits        12474   11837   -637     
- Misses        966    1613   +647

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

ethanwharris

LGTM 😃

ananthsub · 2021-06-28T20:18:45Z

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

+        for key, mem in self.gpus_metrics.items():
+            gpu_id = int(key.split('/')[0].split(':')[1])
+            if gpu_id in self.trainer.accelerator_connector.parallel_device_ids:
+                self.trainer.lightning_module.log(key, mem, prog_bar=False, logger=True, on_step=True, on_epoch=False)


since we're already in the trainer, why do we have to log through the lightning module's log ?

carmocca · 2021-06-29T14:21:56Z

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

+    @property
+    def gpus_metrics(self) -> Dict[str, str]:
+        if self.trainer._device_type == DeviceType.GPU and self.log_gpu_memory:
+            mem_map = memory.get_memory_profile(self.log_gpu_memory)
+            self._gpus_metrics.update(mem_map)
+        return self._gpus_metrics
+


Why did this PR need to add _gpu_metrics? Doesn't seem related to the issue linked.

This means that gpu metrics are now duplicated in this dictionary and in logged metrics.

Also it only gets filled when self.log_gpu_memory so it can't be used anyways without the flag.

I think this broke log_gpu_memory="min_max" option in Trainer.fit. Looks related to: #9010

Essentially memory. get_memory_profile adds two keys that are not in conventional format min_gpu_mem and max_gpu_mem (typically keys are in f"gpu_id: {gpu_id}/memory.used (MB)"

I see #9013 fixed it.

tchaton added 4 commits June 28, 2021 12:37

wip

4549007

wip

3fc2d94

resolve gpu memory logging issue

ead9f88

Merge branch 'bugfix/8159_log_gpu_memory_on_step' of https://github.c…

65d4af8

…om/PyTorchLightning/pytorch-lightning into bugfix/8159_log_gpu_memory_on_step

tchaton added bug Something isn't working logging Related to the `LoggerConnector` and `log()` labels Jun 28, 2021

tchaton added this to the v1.3.x milestone Jun 28, 2021

tchaton self-assigned this Jun 28, 2021

update changelog

c6e40e8

tchaton marked this pull request as ready for review June 28, 2021 12:20

tchaton requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren and williamFalcon as code owners June 28, 2021 12:20

remove change

319ad41

justusschock approved these changes Jun 28, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py Show resolved Hide resolved

ethanwharris approved these changes Jun 28, 2021

View reviewed changes

justusschock enabled auto-merge (squash) June 28, 2021 13:34

awaelchli approved these changes Jun 28, 2021

View reviewed changes

lexierule disabled auto-merge June 28, 2021 13:39

lexierule merged commit c4492ad into master Jun 28, 2021

lexierule deleted the bugfix/8159_log_gpu_memory_on_step branch June 28, 2021 13:39

ananthsub reviewed Jun 28, 2021

View reviewed changes

carmocca reviewed Jun 29, 2021

View reviewed changes

carmocca modified the milestones: v1.3.x, v1.4 Jun 29, 2021

mergify bot added the ready PRs ready to be merged label Aug 20, 2021

This was referenced Aug 20, 2021

Fig logging with log_gpu_memory='min_max' #9013

Merged

Revamp Device Stats Logging #9032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Resolve memory not logged when missing metrics #8174

[bugfix] Resolve memory not logged when missing metrics #8174

tchaton commented Jun 28, 2021 •

edited

Loading

codecov bot commented Jun 28, 2021 •

edited

Loading

ethanwharris left a comment

ananthsub Jun 28, 2021

carmocca Jun 29, 2021

thomasw21 Aug 20, 2021 •

edited

Loading

[bugfix] Resolve memory not logged when missing metrics #8174

[bugfix] Resolve memory not logged when missing metrics #8174

Conversation

tchaton commented Jun 28, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jun 28, 2021 • edited Loading

Codecov Report

ethanwharris left a comment

Choose a reason for hiding this comment

ananthsub Jun 28, 2021

Choose a reason for hiding this comment

carmocca Jun 29, 2021

Choose a reason for hiding this comment

thomasw21 Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

tchaton commented Jun 28, 2021 •

edited

Loading

codecov bot commented Jun 28, 2021 •

edited

Loading

thomasw21 Aug 20, 2021 •

edited

Loading