Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered NaN under the ConditionalDecoderVISEM setting #1

Open
canerozer opened this issue Jan 3, 2021 · 3 comments
Open

Encountered NaN under the ConditionalDecoderVISEM setting #1

canerozer opened this issue Jan 3, 2021 · 3 comments

Comments

@canerozer
Copy link

Hello,

Thanks for open-sourcing this beautiful project. I am currently trying to replicate the results in Table 1 of the paper, however, while I was trying to train the Conditional model, I have experienced this error message stating that a NaN loss value has been encountered, after training the model for 222 epochs. Just to mention, I performed a clean installation of the necessary libraries and I have used the Morpho-MNIST data creation script which you provided. Would there be anything wrong with the calculation of the ELBO of the p(intensity), since that's the only metric that has gone to NaN?

Steps to reproduce the behavior:

python -m deepscm.experiments.morphomnist.trainer -e SVIExperiment -m ConditionalDecoderVISEM --data_dir data/morphomnist/ --default_root_dir checkpoints/ --decoder_type fixed_var --gpus 0

Whole Error Message:

Epoch 222: 6%|▋ | 15/251 [00:01<00:28, 8.31it/s, loss=952243.375, v_num=2]/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pyro/infer/tracegraph_elbo.py:261: UserWarning: Encountered NaN: loss
warn_if_nan(loss, "loss")
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/trainer.py", line 62, in
trainer.fit(experiment)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
self.single_gpu_train(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
_outputs = self.run_training_batch(batch, batch_idx)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 597, in run_training_batch
loss, batch_output = optimizer_closure()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in optimizer_closure
output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 727, in training_forward
output = self.model.training_step(*args)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/sem_vi/base_sem_experiment.py", line 385, in training_step
raise ValueError('loss went to nan with metrics:\n{}'.format(metrics))
ValueError: loss went to nan with metrics:
{'log p(x)': tensor(-3502.9570, device='cuda:0', grad_fn=), 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=), 'log p(thickness)': tensor(-0.9457, device='cuda:0', grad_fn=), 'p(z)': tensor(-22.2051, device='cuda:0', grad_fn=), 'q(z)': tensor(54.3670, device='cuda:0', grad_fn=), 'log p(z) - log q(z)': tensor(-76.5721, device='cuda:0', grad_fn=)}
Exception ignored in: <function tqdm.del at 0x7ffb7e9d2320>
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1135, in del
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1282, in close
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1467, in display
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1138, in repr
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1425, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Environment

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Quadro RTX 6000
Nvidia driver version: 440.100
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.7.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi
[conda] pytorch-lightning 0.7.6 pypi_0 pypi
[conda] torch 1.7.1 pypi_0 pypi
[conda] torchvision 0.6.1 py37_cu102 pytorch

@pawni
Copy link
Collaborator

pawni commented Jan 4, 2021

Thanks for you interest in our paper and the code. I've just gotten around to start updating the dependencies as it seems that you're at least using a newer PyTorch version than what we were using.

I'll get back to you once I updated everything and was able to run the experiments with the new versions.

In the meantime, are you using the fixed pyro version? Also you can experiment with different (lower) learning rates via: --pgm_lr 1e-3 and --lr 5e-5 or get more detailed error messages by using the --validate flag.

However, it seems that only 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=) went to NaN. It's something we observed sometimes as well and lowering the pgm_lr usually helped :)

@canerozer
Copy link
Author

canerozer commented Jan 10, 2021

Hello,

Thanks for your suggestion, I see that the model has been successfully trained after reducing the PGM learning rate. Just to understand it clearly, the pgm_lr hyperparameter affects only the spline layer of the thickness flow and the affine transformation layer of the intensity flow, right? I will take a look at that paper asap.

Meanwhile, I have tried to train the Normalizing Flow models (all 3 settings) but I noticed that they still have a tendency to go to NaN even after a couple of epochs with a learning rate of 10^-4. I am now trying with a learning rate of 10^-5 but I don't know whether that would solve this problem.

BTW, I was using the pyro version of your suggestion (1.3.1+4b2752f8) but I will update the repository right after completing that training attempt.

Edit: Still failing for normalizing flow experiments.

@pawni
Copy link
Collaborator

pawni commented Jan 11, 2021

Oh wait - so you're training a flow only model?

It's only included in the code here for completeness but we also had the issue of running into NaNs when running with flows only which is why we settled on the VI solution.

As for pgm_lr - yes it only acts on the flows for the covariates and not the image components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants