-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move tracking epoch end outputs logic to the EvaluationEpochLoop
#9261
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9261 +/- ##
========================================
+ Coverage 43% 88% +45%
========================================
Files 178 176 -2
Lines 14856 14792 -64
========================================
+ Hits 6436 13035 +6599
+ Misses 8420 1757 -6663 |
bccdd50
to
b494206
Compare
@carmocca very n00b question: does the same change need to be applied to the training epoch/training batch loop as well? if not, i'm curious how this regression happened only for evaluation |
No, it's already properly guarded:
I believe it was missed during the loops refactor PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT !
@@ -123,9 +124,12 @@ def advance( | |||
self.trainer.logger_connector.update_eval_step_metrics() | |||
|
|||
# track epoch level outputs | |||
self.outputs = self._track_output_for_epoch_end(self.outputs, output) | |||
if self._should_track_batch_outputs_for_epoch_end(): | |||
output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found an error here. Not sure if it is intended but when isinstance(output, defaultdict) == True
is of type this causes a reproducible error with the following stack trace:
File "scripts/train.py", line 103, in fine_tune
trainer.fit(model, dataloaders["train"], dataloaders["dev"])
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
self._dispatch()
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
self.accelerator.start_training(self)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
return self._run_train()
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_train
self._run_sanity_check(self.lightning_module)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1122, in _run_sanity_check
self._evaluation_loop.run()
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 126, in advance
output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/memory.py", line 44, in recursive_detach
return apply_to_collection(in_dict, torch.Tensor, detach_and_move, to_cpu=to_cpu)
File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 109, in apply_to_collection
return elem_type(OrderedDict(out))
TypeError: first argument must be callable or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this didn't cause an error because the _track_output_for_epoch_end
function that got removed in this PR seems to not call recursive_detach
for defaultdict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately this error is caused by Line 105 in the apply_to_collection
function in utilities/apply_func.py
https://github.com/PyTorchLightning/pytorch-lightning/blob/1686aab5506ed6f4fee8683ce6cca711e62b5ef0/pytorch_lightning/utilities/apply_func.py#L94-L105
as basically this line is doing
defaultdict(OrderedDict([...]))
which would cause the same error:
TypeError: first argument must be callable or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a separate issue out of this so that others can search the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue left at #10308
What does this PR do?
Fixes #8453, #8583
Does your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
PR review