Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer "optimizers" attribute is None when saving checkpoint and callbacks list is not empty #2936

Closed
import-antigravity opened this issue Aug 12, 2020 · 10 comments · Fixed by #3892
Assignees
Labels
bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Milestone

Comments

@import-antigravity
Copy link

import-antigravity commented Aug 12, 2020

🐛 Bug

I'm training a GAN and I'm running a few custom callbacks as well. When the model attempts to save at the end of the first epoch, it crashes. Here's the very strange thing: I have the exact same code in a Jupyter notebook and the error doesn't occur.

To Reproduce

Steps to reproduce the behavior:

The bug does not occur when the callbacks list passed into the trainer is empty. None of the callbacks I'm using have anything to do with saving checkpoints, they're all for logging certain things about the model. Enabling any one of them causes the error. Running the exact same code in Jupyter results in no crashes.

Stack trace:

Traceback (most recent call last):███████████████████████████████████-| 98.33% [590/600 00:05<00:00 loss: -0.558, v_num: 1, d_loss: -1.120, g_loss: -0.016]
  File "mnist-dense-gan-convergence.py", line 55, in <module>
    main(args)
  File "mnist-dense-gan-convergence.py", line 45, in main
    trainer.fit(gan)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 502, in run_training_epoch
    self.check_checkpoint_callback(should_check_val)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 513, in check_checkpoint_callback
    [c.on_validation_end(self, self.get_model()) for c in checkpoint_callbacks]
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 513, in <listcomp>
    [c.on_validation_end(self, self.get_model()) for c in checkpoint_callbacks]
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 12, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 309, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 346, in _do_check_save
    self._save_model(filepath)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 168, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 268, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/Users/robbie/.conda/envs/ganresearch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 350, in dump_checkpoint
    for i, optimizer in enumerate(self.optimizers):
TypeError: 'NoneType' object is not iterable

Code sample

Here is the relevant part of my setup code:

inception_callback = GANInceptionScorer(classifier, logits=True, sample_size=1000, input_shape=(-1, 1, 28, 28))

log_dir = os.path.abspath('../logs/mnist-dense-gan-convergence')

params = ParameterMatrixCallback()

callbacks = [
    GANProgressBar(),
    GANTensorboardImageView(),
    params,
    inception_callback
]

trainer_args = {
        'max_epochs': 100,
        'default_root_dir': log_dir,
        'callbacks': callbacks,
        'progress_bar_refresh_rate': 0
    }

    print(log_dir)
    try:
        trainer = Trainer(gpus=1, **trainer_args)
    except MisconfigurationException:
        trainer = Trainer(**trainer_args)

    trainer.fit(gan)

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

and the same code in Jupyter:

inception_callback = GANInceptionScorer(classifier, logits=True, sample_size=1000, input_shape=(-1, 1, 28, 28))

log_dir = os.path.abspath('../logs/mnist-gan-dense')

params = ParameterMatrixCallback()

trainer_args = {
    'max_epochs': 200, 
    'callbacks': [GANProgressBar(), GANTensorboardImageView(n=4), params, inception_callback],
    'progress_bar_refresh_rate': 0, 
    'default_root_dir': log_dir
}

t = Trainer(**trainer_args)
  • PyTorch Version (e.g., 1.0): 1.3.1
  • OS (e.g., Linux): macOS
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.7
  • Any other relevant information: pytorch-lightning 0.8.5
@import-antigravity import-antigravity added bug Something isn't working help wanted Open to be worked on labels Aug 12, 2020
@import-antigravity
Copy link
Author

Additional info, here are the relevant methods in my GAN class:

class GAN(LightningModule, ABC):
   ...

    @abstractmethod
    def g_optimizer(self) -> Optimizer:
        pass

    @abstractmethod
    def d_optimizer(self) -> Optimizer:
        pass

    def configure_optimizers(self):
        return self.g_optimizer(), self.d_optimizer()

class MnistGanDense(GAN):
    ...

    def g_optimizer(self) -> Optimizer:
        return optim.RMSprop(self.G.parameters(), self.hparams['learning_rate'])

    def d_optimizer(self) -> Optimizer:
        return optim.RMSprop(self.D.parameters(), self.hparams['learning_rate'])

@williamFalcon
Copy link
Contributor

could you try 0.9.0rc12?

@import-antigravity
Copy link
Author

Is there a way to do that with conda?

@justusschock
Copy link
Member

inside your Conda environment you could also install it with pip

@williamFalcon
Copy link
Contributor

Inside conda you can always install with pip:

pip install pytorch-lightning==0.9.0rc13

If this is still an issue, happy to reopen

@deekshadangwal
Copy link

deekshadangwal commented Sep 1, 2020

This is still a problem for me. I updated to 0.9.1rc1 and still get this error. Here is my trace.

Traceback (most recent call last):
  File "train_unet.py", line 270, in <module>
    trainer.save_checkpoint(args.save_checkpoint_path)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_io.py", line 275, in
 save_checkpointe
    checkpoint = self.dump_checkpoint(weights_only)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_io.py", line 360, in
 dump_checkpoint
    for i, optimizer in enumerate(self.optimizers):
TypeError: 'NoneType' object is not iterable

@import-antigravity
Copy link
Author

@williamFalcon could you open this again? I'm still getting the error as well

@edenlightning edenlightning reopened this Sep 8, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 8, 2020
@edenlightning edenlightning added checkpointing Related to checkpointing priority: 0 High priority task labels Sep 16, 2020
@awaelchli
Copy link
Contributor

@rohitgr7 didn't we recently make optimizers init to an empty list instead of None? I think this should solve the problem. Could you check?

@rohitgr7
Copy link
Contributor

@awaelchli yes its an empty list now. But the code for lightning model defined above has optimizers defined, so am not sure yet what's the issue there.

@import-antigravity mind check this on master?

@Borda
Copy link
Member

Borda commented Oct 2, 2020

@deekshadangwal mind share full sample code so we can reproduce your issue?

@Borda Borda added waiting on author Waiting on user action, correction, or update and removed priority: 0 High priority task labels Oct 2, 2020
@edenlightning edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020
@williamFalcon williamFalcon self-assigned this Oct 6, 2020
williamFalcon added a commit that referenced this issue Oct 6, 2020
williamFalcon added a commit that referenced this issue Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants