calling trainer.test() between epochs has side effects in 0.5.3.2 #517

sneiman · 2019-11-15T18:51:03Z

New user of lightning. First downloaded on Oct 15, and updated today, Nov 15 to 0.5.3.2. Working on Ubuntu 18.04.3lts, pytorch 1.3, python3.6.8m. No virtual environment.

I call trainer.test() in on_epoch_end() at intervals during training - this speeds comparisons to other model architectures.

This worked perfectly in the prior version. The test sequence worked as expected, calling test_step() and test_end() per the spec. Summary reporting to Tensorboard, model status, etc - all as expected. After the call to test_end(), training continued at the next epoch. Behavior repeated with no problems each time trainer.test() called throughout training.

This does not happen as expected in the new version.

#1 If early stopping remains set to default, the training loop exits after the first call to trainer.test() completes. The exit appears normal - as if the call to trainer.test() was an early stopping condition.

#2 if early stopping is turned off by setting 'early_stop_callback=None', the first call to trainer.test() executes as expected, and training continues as expected. However, trainer.test() is now called after EVERY epoch. These extra calls are not originating in my code.

Here is the code making the call:

    def on_epoch_end(self):
        #graph errors: avg training and validation loss
        if self.epochReport:
            self.logger.experiment.add_scalars('Stats_epoch/loss', {'avg_trn_loss': mean([x.item() for x in self.trn_loss]), 'avg_val_loss': mean([x.item() for x in self.val_loss])}, self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/acc',  {'avg_trn_acc':  mean([x for x in self.trn_acc]),         'avg_val_acc':  mean([x for x in self.val_acc])},         self.global_step)

        if (((self.current_epoch+1)%self.test_prd)==0) or ((self.current_epoch+1)==self.max_nb_epochs):
            msg("on_epoch_end")      # for debugging
            self.trainer.test()

Here is the test related code:


    def test_step(self, batch, batch_nb):
        imgs, labels        = batch
        out                 = self.forward(imgs)
        loss                = self.loss(out, labels)

        # stats: calc accuracy, save loss, acc
        # accuracy by category
        out_idx             = torch.argmax(out, 1)
        c                   = (out_idx==labels)
        for i in range(len(c)):
            self.cls_cor[labels[i]] += c[i].item()
            self.cls_tot[labels[i]] += 1

        # acc overall
        acc                 = ((labels==out.argmax(1)).sum()).item()/labels.shape[0]
        return {'test_loss': loss}


    def test_end(self, outputs):

        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()

        # graph test loss, accuracy
        if self.tstReport:
            # get overall accuracy, add text; get accuracy per category, add text; log, reset
            tst_acc   = self.cls_cor.sum() / self.cls_tot.sum()
            text      = f'Overall accuracy of the network on {int(self.cls_tot[0]):d} test images: {100.0 * tst_acc:4.1f}%  \n'
            for i in range(len(self.cls_cor)):
                text += f'Accuracy of {self.labels[i]} : {100.0 * (self.cls_cor[i] / self.cls_tot[i]):4.1f}%  \n'

            self.logger.experiment.add_text(   'Test: accuracy',        text,                                 self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/tst_loss', {'avg tst_loss': avg_loss}, self.global_step)
            self.logger.experiment.add_scalars('Stats_epoch/tst_acc',  {'avg tst_acc' : tst_acc*100.0},      self.global_step)

        # clear data if reporting or not
        for i in range(len(self.cls_cor)):
            self.cls_cor[i] = self.cls_tot[i] = 0

        return {'avg_test_loss': avg_loss}

Any input appreciated.

seth

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-12-04T12:28:36Z

@sneiman still having this issue?

sneiman · 2019-12-04T14:37:52Z

will try new version today. Sent from mobile device - sorry for typos and brevity. This entire message is confidential. If it isn't intended for you, you may not use it - so please throw it away and forget about it. On Dec 4, 2019, at 4:28 AM, William Falcon <[email protected]> wrote: @sneiman<https://github.com/sneiman> still having this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#517>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6YDOV6TUZ3DJBWKTLRLWLQW6O7LANCNFSM4JN63QRQ>.

sneiman · 2019-12-05T18:24:52Z

I am still having this issue. I am not sure if you want me to try it on a new version? I did reinstall on the off chance that the repo had been updated. Let me know if you want me to do something else.

s

sneiman added the bug Something isn't working label Nov 15, 2019

williamFalcon mentioned this issue Mar 3, 2020

fixes test issues on ddp #1017

Merged

williamFalcon closed this as completed in #1017 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

sneiman commented Nov 15, 2019 •

edited

Loading

williamFalcon commented Dec 4, 2019

sneiman commented Dec 4, 2019 via email

sneiman commented Dec 5, 2019 •

edited

Loading

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

calling trainer.test() between epochs has side effects in 0.5.3.2 #517

Comments

sneiman commented Nov 15, 2019 • edited Loading

williamFalcon commented Dec 4, 2019

sneiman commented Dec 4, 2019 via email

sneiman commented Dec 5, 2019 • edited Loading

sneiman commented Nov 15, 2019 •

edited

Loading

sneiman commented Dec 5, 2019 •

edited

Loading