-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training_epoch_end log output gets combined with next epoch training #2455
Comments
Hi! thanks for your contribution!, great first issue! |
An update to the situation: I think I found the cause of the error. It seems that the And then for But it wouldn't increment as we progress to the next epoch This triggered this piece of code from
My suggestion to the solution is to add
Let me know if you foresee any issues or problems with this solution. If you would like me to submit a pull request, I would be happy to do so. |
@ameliatqy please submit a PR. good catch! |
I don't think this is correct. Now it looks like we have a step update on the last batch and additionally in the epoch end call, meaning that after n epochs global step > n * num_batches_per_epoch . |
I made this experiment with before and after.
Expected: All plots end with step 100 = 10 * 10 batches = global step Neither the version before nor the current one work as expected. |
That's a good point - I tried to look into the problem with the previous one and even encountered another error. If I set I will continue working on a solution for this. |
Okay, here is my second shot at a solution for this issue. To keep the steps consistent as addressed by @awaelchli , I decided to the combine the metrics for the last batch, the training_epoch_end metrics and the validation_epoch_end metrics. So I did two things:
And a line of code to increment the global step by one at the very end of
Which _reduce_agg_metrics will transform into:
To do this, I just edited
I guess for this solution to work, you have to make sure that your keys for the metrics for the last batch, the training_epoch_end metrics and the validation_epoch_end metrics are all different and unique (which I believe is already the standard case?). You can see the steps are fixed. Ran 10 batches for 5 epochs. I decided to submit a Draft PR (#2475) because it is easier to view the changes that way as I made changes in multiple areas. Let me know if you see any issues with it |
* Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * Fixes #2455 * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test * added early stop tpu test
🐛 Bug
So, I put 'training_epoch_end' function in my LightningModule. I have it return this dictionary
{'log':{'train_loss': tensor(0.3616, device='cuda:0'), 'epoch': 0}
I check the
run_training_epoch_end
function in the Pytorch library, it looks like it is working normally aslog_epoch_metrics
is showing the 'log' part in the dictionary produced by 'training_epoch_end' function{'train_loss': tensor(0.3616, device='cuda:0'), 'epoch': 0}
So, they send it off to the logger. But there is problem, it is trying to combine the dictionary above with the results from the training step fo the next epoch. When I check the variable
self._metrics_to_agg
, I get the following result. Of course, it is impossible to combine these dictionaries as they have different keys. I guess the main problem is that the code is combining the log results ofrun_training_epoch_end
function with the results of the next training batch.[{'train_loss': tensor(0.3616, device='cuda:0'), 'epoch': 0}, {'loss': 0.48756, 'input': ..., 'ouput': ...}]
Any ideas to solve this problem? I will appreciate your help! Here is the whole error stack:
To Reproduce
Steps to reproduce the behavior:
{'train_loss': tensor(0.3616, device='cuda:0'), 'epoch': 0}
and{'loss': 0.48756, 'input': ..., 'ouput': ...}
are different formats as they don't share the same keys).Code sample
Expected behavior
I thought I would be able to run training_epoch_end function with no combination with the training samples in the next epoch. I expected no error, like running validation_epoch_end.
Environment
- GPU:
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.8.4
- tensorboard: 2.2.2
- tqdm: 4.46.1
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.7
- version: Support of different batch types #113-Ubuntu SMP Wed Jan 29 14:54:54 UTC 2020
The text was updated successfully, but these errors were encountered: