-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async A3C Network Outputs NaN #50
Comments
Thanks for all the detail - pretty sure that this is a numerical instability issue that can occur with a softmax output. @lake4790k I'm guessing the fix on line 88 can also be added after line 53? |
Adding the small epsilon after line 53 does not solve the issue. This looks like it might be a race condition -- setting the number of threads to 1 eliminates the problem entirely (though of course that defeats the whole purpose of A3C). The problem only occurs when two or more threads are being used. |
This sounds like the network params get infected with nans like earlier. There were 2 causes for this, but both had been fixed (proper sharedRmsProp logic and the proper tiny epsilon). It is strange I had a3c running long and didn't get this with current code and you get it quickly. Could be something different with the environment. Maybe check if OpenBLAS threading is not interfering: if you run only one a3c thread only one cpu core should be working. Btw do you have OpenBLAS properly installed and working with Torch? I still don't have my proper machines set up to run experiments and don't want to torture my laptop for too long with this (after a short run I don't get this), but will put together finally my 4790k and try to reproduce this! |
That did the trick -- either my version of OpenBLAS was out of date or Torch wasn't recognizing it fully. A fresh install of the current version of OpenBLAS solved the issue. Thanks for the insights! |
Fresh Torch7 install here on Linux Mint 17 (not using CUDA). I can run all of the demo examples (demo, demo-grid, demo-async, and demo-async-a3c) without issue. Regular DQN and async-nstep also run without issue on Montezuma's Revenge. However, when running async-a3c, I get an error
bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at <torchPath>/lib/TH/generic/THTensorRandom.c:120)
shortly after training begins.The problem occurs at A3CAgent.lua, line 54 -- my own print statements have confirmed that the outputs of the network (
probability
, obtained on the previous line) are all NaN. Adding NaN checks in Model.lua showed that NaNs are being found in the nn.SpatialConvolution 64x64 layer after only a few iterations of training. The problem occurs intermittently (you may need to run it several times before getting the error).Neither an update nor complete reinstall of Torch solved the issue. I have verified that the inputs to the network (passed into A3CAgent.lua, line 54 as
state
) are between 0 and 1, and it does not appear as if any of the training gradients inA3CAgent:accumulateGradients()
are producing Inf or NaN.The issue also occurs when running on a Redhat cluster GPU. Any thoughts?
The text was updated successfully, but these errors were encountered: