Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

Closed
sneiman opened this issue Nov 15, 2019 · 3 comments · Fixed by #1017
Closed

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

sneiman opened this issue Nov 15, 2019 · 3 comments · Fixed by #1017
Labels
bug Something isn't working

Comments

@sneiman
Copy link
Contributor

sneiman commented Nov 15, 2019

New user of lightning. First downloaded on Oct 15, and updated today, Nov 15 to 0.5.3.2. Working on Ubuntu 18.04.3lts, pytorch 1.3, python3.6.8m. No virtual environment.

I call trainer.test() in on_epoch_end() at intervals during training - this speeds comparisons to other model architectures.

This worked perfectly in the prior version. The test sequence worked as expected, calling test_step() and test_end() per the spec. Summary reporting to Tensorboard, model status, etc - all as expected. After the call to test_end(), training continued at the next epoch. Behavior repeated with no problems each time trainer.test() called throughout training.

This does not happen as expected in the new version.

#1 If early stopping remains set to default, the training loop exits after the first call to trainer.test() completes. The exit appears normal - as if the call to trainer.test() was an early stopping condition.

#2 if early stopping is turned off by setting 'early_stop_callback=None', the first call to trainer.test() executes as expected, and training continues as expected. However, trainer.test() is now called after EVERY epoch. These extra calls are not originating in my code.

Here is the code making the call:

    def on_epoch_end(self):
        #graph errors: avg training and validation loss
        if self.epochReport:
            self.logger.experiment.add_scalars('Stats_epoch/loss', {'avg_trn_loss': mean([x.item() for x in self.trn_loss]), 'avg_val_loss': mean([x.item() for x in self.val_loss])}, self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/acc',  {'avg_trn_acc':  mean([x for x in self.trn_acc]),         'avg_val_acc':  mean([x for x in self.val_acc])},         self.global_step)

        if (((self.current_epoch+1)%self.test_prd)==0) or ((self.current_epoch+1)==self.max_nb_epochs):
            msg("on_epoch_end")      # for debugging
            self.trainer.test()

Here is the test related code:


    def test_step(self, batch, batch_nb):
        imgs, labels        = batch
        out                 = self.forward(imgs)
        loss                = self.loss(out, labels)

        # stats: calc accuracy, save loss, acc
        # accuracy by category
        out_idx             = torch.argmax(out, 1)
        c                   = (out_idx==labels)
        for i in range(len(c)):
            self.cls_cor[labels[i]] += c[i].item()
            self.cls_tot[labels[i]] += 1

        # acc overall
        acc                 = ((labels==out.argmax(1)).sum()).item()/labels.shape[0]
        return {'test_loss': loss}


    def test_end(self, outputs):

        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()

        # graph test loss, accuracy
        if self.tstReport:
            # get overall accuracy, add text; get accuracy per category, add text; log, reset
            tst_acc   = self.cls_cor.sum() / self.cls_tot.sum()
            text      = f'Overall accuracy of the network on {int(self.cls_tot[0]):d} test images: {100.0 * tst_acc:4.1f}%  \n'
            for i in range(len(self.cls_cor)):
                text += f'Accuracy of {self.labels[i]} : {100.0 * (self.cls_cor[i] / self.cls_tot[i]):4.1f}%  \n'

            self.logger.experiment.add_text(   'Test: accuracy',        text,                                 self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/tst_loss', {'avg tst_loss': avg_loss}, self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/tst_acc',  {'avg tst_acc' : tst_acc*100.0},      self.global_step)

        # clear data if reporting or not
        for i in range(len(self.cls_cor)):
            self.cls_cor[i] = self.cls_tot[i] = 0

        return {'avg_test_loss': avg_loss}

Any input appreciated.

seth

@sneiman sneiman added the bug Something isn't working label Nov 15, 2019
@williamFalcon
Copy link
Contributor

@sneiman still having this issue?

@sneiman
Copy link
Contributor Author

sneiman commented Dec 4, 2019 via email

@sneiman
Copy link
Contributor Author

sneiman commented Dec 5, 2019

I am still having this issue. I am not sure if you want me to try it on a new version? I did reinstall on the off chance that the repo had been updated. Let me know if you want me to do something else.

s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants