Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss=None and no logs when automatic_optimization=False #4204

Closed
denadai2 opened this issue Oct 17, 2020 · 13 comments · Fixed by #4476
Closed

loss=None and no logs when automatic_optimization=False #4204

denadai2 opened this issue Oct 17, 2020 · 13 comments · Fixed by #4476
Assignees
Labels
bug Something isn't working docs Documentation related logger Related to the Loggers
Milestone

Comments

@denadai2
Copy link

🐛 Bug

I think there is a bug when automatic_optimization=False. The loss=None (https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L336) and this means that all the checkpoint_callbacks cannot work. There is no way to set the loss.

I also add that in the documentation (https://pytorch-lightning.readthedocs.io/en/latest/optimizers.html#manual-optimization) the training_step does not return anything. However, if it does not return anything, all the logs do not work because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681.

Expected behavior

There should be a way to set the loss, and the behaviour when nothing is returned in training_step should be clear.

Environment

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.0.2
        - tqdm:              4.48.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.6.9
        - version:           #26-Ubuntu SMP Mon Jun 24 09:32:08 UTC 2019
@denadai2 denadai2 added bug Something isn't working help wanted Open to be worked on labels Oct 17, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@edenlightning edenlightning added this to the 1.0.3 milestone Oct 19, 2020
@edenlightning edenlightning added priority: 0 High priority task checkpointing Related to checkpointing labels Oct 20, 2020
@edenlightning
Copy link
Contributor

alsp #4295

@SeanNaren SeanNaren added the docs Documentation related label Nov 1, 2020
@SeanNaren
Copy link
Contributor

Thanks @denadai2! I'll modify the doc example to report loss value, if you care about logging your loss values (which in most cases is yes!)

@denadai2
Copy link
Author

denadai2 commented Nov 2, 2020

@SeanNaren actually, it's not only about the doc. It's also that if the loss is nan, pytorch lightning skips to write ALL the logged variables because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681

@SeanNaren
Copy link
Contributor

SeanNaren commented Nov 2, 2020

Thanks @denadai2 just to confirm the logic is as below:

You've overridden the training step and set automatic_optimization to false.
Your training_step function logs metrics now using self.log, but never returns a loss (you may have multiple losses or something)

You would like other metrics aside from the loss to be logged in training, but because of this line:

https://github.com/PyTorchLightning/pytorch-lightning/blob/f40d08679d31ef6e705f1e0e5a66473c817325e1/pytorch_lightning/trainer/training_loop.py#L721

We never get to this part of the code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f40d08679d31ef6e705f1e0e5a66473c817325e1/pytorch_lightning/trainer/training_loop.py#L725-L730

I think custom logged metrics in callbacks is the only thing that will not be logged for now (there is a major refactor coming which should fix this #4439 ), and for now you'll need to log using self.log within the lightning module functions. Let me know if this is a solution that solves your issue!

@SeanNaren SeanNaren reopened this Nov 2, 2020
@edenlightning edenlightning added logger Related to the Loggers and removed checkpointing Related to checkpointing labels Nov 3, 2020
@edenlightning
Copy link
Contributor

Currently blocked on #4495

@edenlightning edenlightning added the bug Something isn't working label Nov 3, 2020
@asalimih
Copy link

asalimih commented Nov 4, 2020

Hi
I'm not sure if my problem is related to this bug it seems so.
because I wanted to do backward and step in the middle of training_step so I set the automatic_optimization to False. I'm logging in the training_step and validation_step as follows:

def training_step(self, batch, batch_idx):
    ...
   self.log('train_loss', loss.view(1,).item(), prog_bar=True)
   self.log('train_ci', train_cindex, prog_bar=True)
   return loss

def validation_step(self, batch, batch_idx):
   ...
   self.log('val_loss', eval_loss.view(1,).item(), prog_bar=True)
   self.log('val_ci'  , eval_cindex, prog_bar=True)
   return eval_loss

It works quite well until it reaches a specific epoch and throws this error (the batch size and the dataset size are the same):

Epoch 49:  50%|██████████████████                  | 1/2 [00:00<00:00,  2.34it/s, loss=nan, v_num=54, train_loss=3.51, train_ci=0.938, val_loss=3.58, val_ci=0.689]Traceback (most recent call last):
  File "trainer_main.py", line 45, in <module>
    trainer.fit(model, tcga_dm)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
    self.train_loop.run_training_epoch()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 565, in run_training_epoch
    self.trainer.logger_connector.log_train_step_metrics(batch_output)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 536, in log_train_step_metrics
    self.log_metrics(metrics, grad_norm_dic)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 79, in log_metrics
    metrics.update(grad_norm_dic)
TypeError: 'NoneType' object is not iterable
Exception ignored in: <function tqdm.__del__ at 0x7f34fe50eca0>
Traceback (most recent call last):
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1122, in __del__
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1335, in close
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1514, in display
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1125, in __repr__
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1475, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I checked the loss and eval_loss to not be None and they are not None. it seems the error happens right after the return loss.
also I think a documentation is needed to specify how to log and return values in training_step and validation_step. this was a bit confusing for me.

@asalimih
Copy link

asalimih commented Nov 4, 2020

after some debugging I solved the Error by changing the training_step to this:

def training_step(self, batch, batch_idx):
    ...
   self.log('loss', loss.view(1,).item(), prog_bar=True, logger=True)
   self.log('train_ci', train_cindex, prog_bar=True, logger=True)

However now although I set the logger to True, the loss and train_ci aren't logged in the tensorboard. and also I cannot see them in the progress bar.
@SeanNaren #4439 didn't solve this problem

@denadai2
Copy link
Author

denadai2 commented Nov 7, 2020

@asalimih try returning the loss at the end of training_step. This temporanly solves the bug I pointed out.

@asalimih
Copy link

asalimih commented Nov 7, 2020

@asalimih try returning the loss at the end of training_step. This temporanly solves the bug I pointed out.

@denadai2 the first error I explained here had the return loss line. after I removed the line it didn't throw error but now metrics wouldn't be logged.

@SeanNaren
Copy link
Contributor

We're currently in a deep dive into automatic_optimization=False behaviour after a lot of different bugs have appeared in different edge case situations. Please have a look at #4485

The logging changes will hopefully make logs a little clearer, but in terms of actual functionality for automatic_optimization we're in the process of debugging.

@asalimih could you reproduce the bug with this? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py

@denadai2 just finishing the final logging refactor here: #4552

Once we figure out some of the functionality issues with automatic_optimization, i'll circle back here.

@edenlightning
Copy link
Contributor

@tchaton

@tchaton tchaton self-assigned this Nov 10, 2020
@edenlightning edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020
@Borda Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020
@edenlightning edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 13, 2020
@tchaton tchaton closed this as completed Nov 13, 2020
@Borda Borda modified the milestones: 1.0.7, 1.0.x Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related logger Related to the Loggers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants