-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent step counting for train and val after resuming from checkpoint #5547
Comments
Hi! thanks for your contribution!, great first issue! |
Was it solved here? #5050 |
also, perhaps at the same time you may need this feature #5351 to avoid dropping logs during validation :) |
Hi @awaelchli, Unfortunately, no and no. |
Right, the first PR makes sure we start logging after the last logged step from resumed run but it assumed we would start logging at 0 using a new Trainer (even when reloading a model).
@AsaphLightricks I'm curious if you were able to test this change to see if it would work in your case? |
@borisdayma, I just tested it, and it solves the problem. I commented out
with this change, i was able to resume a run from a checkpoint and the wandb logger works just fine (not dropping logs). |
I cannot say for 100% (I will test later), but if I remember correctly then global_step is resumed (so not restarting at 0) but total_batch_idx is reset. It is also not the same as batch_idx in training_step. |
Hi, I'm just checking if someone from PL team knows why the step is different or if we can just make the change proposed by Asaph. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
No, we cannot implement the change proposed to @AsaphLightricks |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
https://github.com/PyTorchLightning/pytorch-lightning/blob/9ebbfece5e2c56bb5300cfffafb129e399492469/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L187-L193
I have an issue where if i resume a run from a checkpoint, the step being passed to my WandbLogger is inconsistent.
After resuming, in training steps the step count is set to be
batch_idx
which starts from 0 even if the training was resumed.However for validation steps the step count is set to be
self.trainer.global_step
which starts from the recovered global_step from the checkpoint.Why isn't the step count during training isn't set to
self.trainer.global_step
as well?This discrepancy is causing my WandbLogger to drop logs. During the first epoch of a resumed run the steps counts being passed to the WandbLogger are 0, 1, 2, ... (because they are the
batch_idx
), but then, during validation they suddenly jump to beself.trainer.global_step
which is much higher because the run was resumed. Then, as the second epoch starts, the steps count goes back to bebatch_idx
, which is lower than theself.trainer.global_step
. Then, the WandbLogger sees that the internal step is not monotonically increasing, and drops the logs.The text was updated successfully, but these errors were encountered: