-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats logging in "on_train_epoch_end" ends up on wrong progress bar #19322
Comments
Injecting new line :
@jojje Do you meant this approach ? |
Hey @jojje It could be seen as an issue or not, it depends. If you Regarding "why does Epoch 2 show twice" it is because you have print statements and the TQDM bar will continue to write updates to the progress bar after your prints. If you want to avoid that, use |
@awaelchli I tried the two changes you proposed, It solved the "off by one" problem, but at the cost of a performance hit. It also doesn't solve the problem of the individual epoch progress bars vanishing, causing data loss in the console output. Change: @@ -7,5 +7,4 @@ class DemoNet(pl.LightningModule):
super().__init__()
self.fc = torch.nn.Linear(784, 10)
- self.batch_losses = []
def configure_optimizers(self):
@@ -17,12 +16,9 @@ class DemoNet(pl.LightningModule):
yh = self.fc(x)
loss = torch.nn.functional.cross_entropy(yh, y)
- self.batch_losses.append(loss)
+ self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
return loss
def on_train_epoch_end(self):
- loss = torch.stack(self.batch_losses).mean()
- self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
- self.batch_losses.clear()
- print("")
+ self.print("")
ds = torchvision.datasets.MNIST(root="dataset/", train=True, transform=torchvision.transforms.ToTensor(), download=True) Resulting output:
As you can see,
The reason why I didn't let Lightning calculate the stats automatically via the on_epoch (end) flag is because it's expensive. On my test run above, the training takes a 25% performance (throughput) hit by logging on each training step. with Right now I'm just in an evaluation phase seeing if Lightning might be something we can use going forward, but these initial 101 training ergonomics have put such notions on ice. I like the idea of bringing more structure to training, but can unfortunately not sell the idea of a new framework without even the basics being handled correctly, so that's why I opened this issue. I look forward to hearing further suggestions on how to leverage lightning correctly, so as to pass the initial sniff test ;) To reiterate the composite objective:
|
Update, workaround that makes lightning log as expected:
The key bit of information here is the need to subclass the It would be great if every user didn't have to deal with all that boiler plate for every project, and instead the TQDMProgressBar constructor taking an optional argument such as "leave:bool" (same as tqdm) that you'd then check in the code to decide whether to close the progress bars or not. E.g.
|
A PR for discussion and review has been submitted to address this issue. Reviewer note: There was a failed test, but it seems entirely unrelated. In fact, the change was made such that there is zero change in behavior by default, and explicitly setting a new flag (which no existing tests could possibly be aware of) is required to enable the new behavior , so I don't how this change could possibly be related to the failure of core/test_metric_result_integration.py::test_result_reduce_ddp |
Bug description
When logging statistics at the end of an epoch from within
on_train_epoch_end
, the statistics end up on the wrong progress bar.Since there doesn't seem to be a configuration to tell lightning nor the
TQDMProgressBar
to retain the bar for each epoch, I've been forced to inject a new line after each epoch ends, in order to not lose any of the valuable statistics in the console output.The following is the output from a 3 epoch run:
If there is a proper way to retain the progress bar for each epoch that is different from what I'm doing, then please let me know and this ticket can then be closed. If not, hopefully a fix can be found.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs
N/A
Environment
Current environment
Lightning:
System:
CUDA:
How you installed Lightning(
conda
,pip
, source): pipRunning environment of LightningApp (e.g. local, cloud): local
More info
No response
cc @carmocca
The text was updated successfully, but these errors were encountered: