Move tracking epoch end outputs logic to the `EvaluationEpochLoop` #9261

carmocca · 2021-09-01T22:35:25Z

What does this PR do?

Fixes #8453, #8583

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[n/a] Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

pytorch_lightning/loops/epoch/evaluation_epoch_loop.py

codecov · 2021-09-01T23:41:29Z

Codecov Report

Merging #9261 (bccdd50) into master (ddb4dc2) will increase coverage by 45%.
The diff coverage is 100%.

❗ Current head bccdd50 differs from pull request most recent head 5ced423. Consider uploading reports for the commit 5ced423 to get more accurate results

@@           Coverage Diff            @@
##           master   #9261     +/-   ##
========================================
+ Coverage      43%     88%    +45%     
========================================
  Files         178     176      -2     
  Lines       14856   14792     -64     
========================================
+ Hits         6436   13035   +6599     
+ Misses       8420    1757   -6663

ananthsub · 2021-09-02T23:52:49Z

@carmocca very n00b question: does the same change need to be applied to the training epoch/training batch loop as well? if not, i'm curious how this regression happened only for evaluation

carmocca · 2021-09-03T00:06:24Z

does the same change need to be applied to the training epoch/training batch loop as well?

No, it's already properly guarded:

https://github.com/PyTorchLightning/pytorch-lightning/blob/ddb4dc26590d187d6569560faeb99721a40f44b9/pytorch_lightning/loops/epoch/training_epoch_loop.py#L272-L274

i'm curious how this regression happened only for evaluation

I believe it was missed during the loops refactor PRs

tchaton

LGMT !

tests/trainer/loops/test_evaluation_loop.py

…9261)

jshin49 · 2021-11-02T14:40:14Z

pytorch_lightning/loops/epoch/evaluation_epoch_loop.py

@@ -123,9 +124,12 @@ def advance(
        self.trainer.logger_connector.update_eval_step_metrics()

        # track epoch level outputs
-        self.outputs = self._track_output_for_epoch_end(self.outputs, output)
+        if self._should_track_batch_outputs_for_epoch_end():
+            output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu)


I found an error here. Not sure if it is intended but when isinstance(output, defaultdict) == True is of type this causes a reproducible error with the following stack trace:

File "scripts/train.py", line 103, in fine_tune trainer.fit(model, dataloaders["train"], dataloaders["dev"]) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit self._run(model) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run self._dispatch() File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch self.accelerator.start_training(self) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage return self._run_train() File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_train self._run_sanity_check(self.lightning_module) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1122, in _run_sanity_check self._evaluation_loop.run() File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance dl_outputs = self.epoch_loop.run( File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 126, in advance output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/memory.py", line 44, in recursive_detach return apply_to_collection(in_dict, torch.Tensor, detach_and_move, to_cpu=to_cpu) File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 109, in apply_to_collection return elem_type(OrderedDict(out)) TypeError: first argument must be callable or None

before this didn't cause an error because the _track_output_for_epoch_end function that got removed in this PR seems to not call recursive_detach for defaultdict

Ultimately this error is caused by Line 105 in the apply_to_collection function in utilities/apply_func.py
https://github.com/PyTorchLightning/pytorch-lightning/blob/1686aab5506ed6f4fee8683ce6cca711e62b5ef0/pytorch_lightning/utilities/apply_func.py#L94-L105

as basically this line is doing
defaultdict(OrderedDict([...]))

which would cause the same error:

TypeError: first argument must be callable or None

I'll make a separate issue out of this so that others can search the same.

Issue left at #10308

carmocca added the bug Something isn't working label Sep 1, 2021

carmocca added this to the v1.4.x milestone Sep 1, 2021

carmocca self-assigned this Sep 1, 2021

carmocca marked this pull request as ready for review September 1, 2021 23:20

carmocca requested review from awaelchli, Borda, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners September 1, 2021 23:20

awaelchli approved these changes Sep 1, 2021

View reviewed changes

pytorch_lightning/loops/epoch/evaluation_epoch_loop.py Show resolved Hide resolved

carmocca added 5 commits September 2, 2021 02:00

Move tracking epoch end outputs logic to the EvaluationEpochLoop

4459d25

cache clear

f32685b

Add test

6d2488c

Update CHANGELOG

077c1b7

move on_advance_end call just in case

b494206

carmocca force-pushed the bugfix/eval-loop-outputs branch from bccdd50 to b494206 Compare September 2, 2021 00:01

mergify bot added the has conflicts label Sep 2, 2021

Merge branch 'master' into bugfix/eval-loop-outputs

5ced423

mergify bot removed the has conflicts label Sep 2, 2021

ananthsub approved these changes Sep 2, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Sep 2, 2021

Fix pep8

e78f6bd

carmocca enabled auto-merge (squash) September 3, 2021 00:10

tchaton approved these changes Sep 3, 2021

View reviewed changes

tests/trainer/loops/test_evaluation_loop.py Show resolved Hide resolved

carmocca merged commit f745aa9 into master Sep 3, 2021

carmocca deleted the bugfix/eval-loop-outputs branch September 3, 2021 13:02

carmocca mentioned this pull request Sep 3, 2021

Add test assertion #9309

Merged

4 tasks

justusschock added a commit that referenced this pull request Sep 7, 2021

Move tracking epoch end outputs logic to the EvaluationEpochLoop (#…

1b8336e

…9261)

Borda pushed a commit that referenced this pull request Sep 7, 2021

Move tracking epoch end outputs logic to the EvaluationEpochLoop (#…

68dcd06

…9261)

awaelchli pushed a commit that referenced this pull request Sep 7, 2021

Move tracking epoch end outputs logic to the EvaluationEpochLoop (#…

fc841cc

…9261)

lexierule pushed a commit that referenced this pull request Sep 10, 2021

Move tracking epoch end outputs logic to the EvaluationEpochLoop (#…

62bb93f

…9261)

jshin49 reviewed Nov 2, 2021

View reviewed changes

jshin49 mentioned this pull request Nov 2, 2021

when the validation_step function returns a type defaultdict, TypeError: first argument must be callable or None occurs #10308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move tracking epoch end outputs logic to the `EvaluationEpochLoop` #9261

Move tracking epoch end outputs logic to the `EvaluationEpochLoop` #9261

carmocca commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

ananthsub commented Sep 2, 2021 •

edited

Loading

carmocca commented Sep 3, 2021

tchaton left a comment

jshin49 Nov 2, 2021 •

edited

Loading

jshin49 Nov 2, 2021

jshin49 Nov 2, 2021 •

edited

Loading

jshin49 Nov 2, 2021

jshin49 Nov 2, 2021

Move tracking epoch end outputs logic to the EvaluationEpochLoop #9261

Move tracking epoch end outputs logic to the EvaluationEpochLoop #9261

Conversation

carmocca commented Sep 1, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

codecov bot commented Sep 1, 2021 • edited Loading

Codecov Report

ananthsub commented Sep 2, 2021 • edited Loading

carmocca commented Sep 3, 2021

tchaton left a comment

Choose a reason for hiding this comment

jshin49 Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

jshin49 Nov 2, 2021

Choose a reason for hiding this comment

jshin49 Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

jshin49 Nov 2, 2021

Choose a reason for hiding this comment

jshin49 Nov 2, 2021

Choose a reason for hiding this comment

Move tracking epoch end outputs logic to the `EvaluationEpochLoop` #9261

Move tracking epoch end outputs logic to the `EvaluationEpochLoop` #9261

carmocca commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

ananthsub commented Sep 2, 2021 •

edited

Loading

jshin49 Nov 2, 2021 •

edited

Loading

jshin49 Nov 2, 2021 •

edited

Loading