Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tracking epoch end outputs logic to the EvaluationEpochLoop #9261

Merged
merged 7 commits into from
Sep 3, 2021

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Sep 1, 2021

What does this PR do?

Fixes #8453, #8583

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@carmocca carmocca added the bug Something isn't working label Sep 1, 2021
@carmocca carmocca added this to the v1.4.x milestone Sep 1, 2021
@carmocca carmocca self-assigned this Sep 1, 2021
@carmocca carmocca marked this pull request as ready for review September 1, 2021 23:20
@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #9261 (bccdd50) into master (ddb4dc2) will increase coverage by 45%.
The diff coverage is 100%.

❗ Current head bccdd50 differs from pull request most recent head 5ced423. Consider uploading reports for the commit 5ced423 to get more accurate results

@@           Coverage Diff            @@
##           master   #9261     +/-   ##
========================================
+ Coverage      43%     88%    +45%     
========================================
  Files         178     176      -2     
  Lines       14856   14792     -64     
========================================
+ Hits         6436   13035   +6599     
+ Misses       8420    1757   -6663     

@carmocca carmocca force-pushed the bugfix/eval-loop-outputs branch from bccdd50 to b494206 Compare September 2, 2021 00:01
@mergify mergify bot added the has conflicts label Sep 2, 2021
@mergify mergify bot removed the has conflicts label Sep 2, 2021
@mergify mergify bot added the ready PRs ready to be merged label Sep 2, 2021
@ananthsub
Copy link
Contributor

ananthsub commented Sep 2, 2021

@carmocca very n00b question: does the same change need to be applied to the training epoch/training batch loop as well? if not, i'm curious how this regression happened only for evaluation

@carmocca
Copy link
Contributor Author

carmocca commented Sep 3, 2021

does the same change need to be applied to the training epoch/training batch loop as well?

No, it's already properly guarded:

https://github.com/PyTorchLightning/pytorch-lightning/blob/ddb4dc26590d187d6569560faeb99721a40f44b9/pytorch_lightning/loops/epoch/training_epoch_loop.py#L272-L274

i'm curious how this regression happened only for evaluation

I believe it was missed during the loops refactor PRs

@carmocca carmocca enabled auto-merge (squash) September 3, 2021 00:10
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT !

@carmocca carmocca merged commit f745aa9 into master Sep 3, 2021
@@ -123,9 +124,12 @@ def advance(
self.trainer.logger_connector.update_eval_step_metrics()

# track epoch level outputs
self.outputs = self._track_output_for_epoch_end(self.outputs, output)
if self._should_track_batch_outputs_for_epoch_end():
output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu)
Copy link

@jshin49 jshin49 Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an error here. Not sure if it is intended but when isinstance(output, defaultdict) == True is of type this causes a reproducible error with the following stack trace:

  File "scripts/train.py", line 103, in fine_tune
    trainer.fit(model, dataloaders["train"], dataloaders["dev"])
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
    self.accelerator.start_training(self)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
    return self._run_train()
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1122, in _run_sanity_check
    self._evaluation_loop.run()
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 126, in advance
    output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/memory.py", line 44, in recursive_detach
    return apply_to_collection(in_dict, torch.Tensor, detach_and_move, to_cpu=to_cpu)
  File "/root/.local/share/virtualenvs/zero/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 109, in apply_to_collection
    return elem_type(OrderedDict(out))
TypeError: first argument must be callable or None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this didn't cause an error because the _track_output_for_epoch_end function that got removed in this PR seems to not call recursive_detach for defaultdict

Copy link

@jshin49 jshin49 Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately this error is caused by Line 105 in the apply_to_collection function in utilities/apply_func.py
https://github.com/PyTorchLightning/pytorch-lightning/blob/1686aab5506ed6f4fee8683ce6cca711e62b5ef0/pytorch_lightning/utilities/apply_func.py#L94-L105

as basically this line is doing
defaultdict(OrderedDict([...]))

which would cause the same error:

TypeError: first argument must be callable or None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a separate issue out of this so that others can search the same.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue left at #10308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

evaluation_loop memory leak
5 participants