You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New user of lightning. First downloaded on Oct 15, and updated today, Nov 15 to 0.5.3.2. Working on Ubuntu 18.04.3lts, pytorch 1.3, python3.6.8m. No virtual environment.
I call trainer.test() in on_epoch_end() at intervals during training - this speeds comparisons to other model architectures.
This worked perfectly in the prior version. The test sequence worked as expected, calling test_step() and test_end() per the spec. Summary reporting to Tensorboard, model status, etc - all as expected. After the call to test_end(), training continued at the next epoch. Behavior repeated with no problems each time trainer.test() called throughout training.
This does not happen as expected in the new version.
#1 If early stopping remains set to default, the training loop exits after the first call to trainer.test() completes. The exit appears normal - as if the call to trainer.test() was an early stopping condition.
#2 if early stopping is turned off by setting 'early_stop_callback=None', the first call to trainer.test() executes as expected, and training continues as expected. However, trainer.test() is now called after EVERY epoch. These extra calls are not originating in my code.
Here is the code making the call:
def on_epoch_end(self):
#graph errors: avg training and validation loss
if self.epochReport:
self.logger.experiment.add_scalars('Stats_epoch/loss', {'avg_trn_loss': mean([x.item() for x in self.trn_loss]), 'avg_val_loss': mean([x.item() for x in self.val_loss])}, self.global_step)
self.logger.experiment.add_scalars('Stats_epoch/acc', {'avg_trn_acc': mean([x for x in self.trn_acc]), 'avg_val_acc': mean([x for x in self.val_acc])}, self.global_step)
if (((self.current_epoch+1)%self.test_prd)==0) or ((self.current_epoch+1)==self.max_nb_epochs):
msg("on_epoch_end") # for debugging
self.trainer.test()
Here is the test related code:
def test_step(self, batch, batch_nb):
imgs, labels = batch
out = self.forward(imgs)
loss = self.loss(out, labels)
# stats: calc accuracy, save loss, acc
# accuracy by category
out_idx = torch.argmax(out, 1)
c = (out_idx==labels)
for i in range(len(c)):
self.cls_cor[labels[i]] += c[i].item()
self.cls_tot[labels[i]] += 1
# acc overall
acc = ((labels==out.argmax(1)).sum()).item()/labels.shape[0]
return {'test_loss': loss}
def test_end(self, outputs):
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
# graph test loss, accuracy
if self.tstReport:
# get overall accuracy, add text; get accuracy per category, add text; log, reset
tst_acc = self.cls_cor.sum() / self.cls_tot.sum()
text = f'Overall accuracy of the network on {int(self.cls_tot[0]):d} test images: {100.0 * tst_acc:4.1f}% \n'
for i in range(len(self.cls_cor)):
text += f'Accuracy of {self.labels[i]} : {100.0 * (self.cls_cor[i] / self.cls_tot[i]):4.1f}% \n'
self.logger.experiment.add_text( 'Test: accuracy', text, self.global_step)
self.logger.experiment.add_scalars('Stats_epoch/tst_loss', {'avg tst_loss': avg_loss}, self.global_step)
self.logger.experiment.add_scalars('Stats_epoch/tst_acc', {'avg tst_acc' : tst_acc*100.0}, self.global_step)
# clear data if reporting or not
for i in range(len(self.cls_cor)):
self.cls_cor[i] = self.cls_tot[i] = 0
return {'avg_test_loss': avg_loss}
Any input appreciated.
seth
The text was updated successfully, but these errors were encountered:
I am still having this issue. I am not sure if you want me to try it on a new version? I did reinstall on the off chance that the repo had been updated. Let me know if you want me to do something else.
New user of lightning. First downloaded on Oct 15, and updated today, Nov 15 to 0.5.3.2. Working on Ubuntu 18.04.3lts, pytorch 1.3, python3.6.8m. No virtual environment.
I call trainer.test() in on_epoch_end() at intervals during training - this speeds comparisons to other model architectures.
This worked perfectly in the prior version. The test sequence worked as expected, calling test_step() and test_end() per the spec. Summary reporting to Tensorboard, model status, etc - all as expected. After the call to test_end(), training continued at the next epoch. Behavior repeated with no problems each time trainer.test() called throughout training.
This does not happen as expected in the new version.
#1 If early stopping remains set to default, the training loop exits after the first call to trainer.test() completes. The exit appears normal - as if the call to trainer.test() was an early stopping condition.
#2 if early stopping is turned off by setting 'early_stop_callback=None', the first call to trainer.test() executes as expected, and training continues as expected. However, trainer.test() is now called after EVERY epoch. These extra calls are not originating in my code.
Here is the code making the call:
Here is the test related code:
Any input appreciated.
seth
The text was updated successfully, but these errors were encountered: