-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountered NaN under the ConditionalDecoderVISEM setting #1
Comments
Thanks for you interest in our paper and the code. I've just gotten around to start updating the dependencies as it seems that you're at least using a newer PyTorch version than what we were using. I'll get back to you once I updated everything and was able to run the experiments with the new versions. In the meantime, are you using the fixed pyro version? Also you can experiment with different (lower) learning rates via: However, it seems that only |
Hello, Thanks for your suggestion, I see that the model has been successfully trained after reducing the PGM learning rate. Just to understand it clearly, the Meanwhile, I have tried to train the Normalizing Flow models (all 3 settings) but I noticed that they still have a tendency to go to NaN even after a couple of epochs with a learning rate of 10^-4. I am now trying with a learning rate of 10^-5 but I don't know whether that would solve this problem. BTW, I was using the pyro version of your suggestion (1.3.1+4b2752f8) but I will update the repository right after completing that training attempt. Edit: Still failing for normalizing flow experiments. |
Oh wait - so you're training a flow only model? It's only included in the code here for completeness but we also had the issue of running into NaNs when running with flows only which is why we settled on the VI solution. As for |
Hello,
Thanks for open-sourcing this beautiful project. I am currently trying to replicate the results in Table 1 of the paper, however, while I was trying to train the Conditional model, I have experienced this error message stating that a NaN loss value has been encountered, after training the model for 222 epochs. Just to mention, I performed a clean installation of the necessary libraries and I have used the Morpho-MNIST data creation script which you provided. Would there be anything wrong with the calculation of the ELBO of the p(intensity), since that's the only metric that has gone to NaN?
Steps to reproduce the behavior:
python -m deepscm.experiments.morphomnist.trainer -e SVIExperiment -m ConditionalDecoderVISEM --data_dir data/morphomnist/ --default_root_dir checkpoints/ --decoder_type fixed_var --gpus 0
Whole Error Message:
Epoch 222: 6%|▋ | 15/251 [00:01<00:28, 8.31it/s, loss=952243.375, v_num=2]/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pyro/infer/tracegraph_elbo.py:261: UserWarning: Encountered NaN: loss
warn_if_nan(loss, "loss")
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/trainer.py", line 62, in
trainer.fit(experiment)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
self.single_gpu_train(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
_outputs = self.run_training_batch(batch, batch_idx)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 597, in run_training_batch
loss, batch_output = optimizer_closure()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in optimizer_closure
output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 727, in training_forward
output = self.model.training_step(*args)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/sem_vi/base_sem_experiment.py", line 385, in training_step
raise ValueError('loss went to nan with metrics:\n{}'.format(metrics))
ValueError: loss went to nan with metrics:
{'log p(x)': tensor(-3502.9570, device='cuda:0', grad_fn=), 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=), 'log p(thickness)': tensor(-0.9457, device='cuda:0', grad_fn=), 'p(z)': tensor(-22.2051, device='cuda:0', grad_fn=), 'q(z)': tensor(54.3670, device='cuda:0', grad_fn=), 'log p(z) - log q(z)': tensor(-76.5721, device='cuda:0', grad_fn=)}
Exception ignored in: <function tqdm.del at 0x7ffb7e9d2320>
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1135, in del
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1282, in close
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1467, in display
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1138, in repr
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1425, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Environment
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Quadro RTX 6000
Nvidia driver version: 440.100
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.7.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi
[conda] pytorch-lightning 0.7.6 pypi_0 pypi
[conda] torch 1.7.1 pypi_0 pypi
[conda] torchvision 0.6.1 py37_cu102 pytorch
The text was updated successfully, but these errors were encountered: