Async A3C Network Outputs NaN #50

lordzapharos · 2016-07-26T01:41:17Z

Fresh Torch7 install here on Linux Mint 17 (not using CUDA). I can run all of the demo examples (demo, demo-grid, demo-async, and demo-async-a3c) without issue. Regular DQN and async-nstep also run without issue on Montezuma's Revenge. However, when running async-a3c, I get an error bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at <torchPath>/lib/TH/generic/THTensorRandom.c:120) shortly after training begins.

The problem occurs at A3CAgent.lua, line 54 -- my own print statements have confirmed that the outputs of the network (probability, obtained on the previous line) are all NaN. Adding NaN checks in Model.lua showed that NaNs are being found in the nn.SpatialConvolution 64x64 layer after only a few iterations of training. The problem occurs intermittently (you may need to run it several times before getting the error).

Neither an update nor complete reinstall of Torch solved the issue. I have verified that the inputs to the network (passed into A3CAgent.lua, line 54 as state) are between 0 and 1, and it does not appear as if any of the training gradients in A3CAgent:accumulateGradients() are producing Inf or NaN.

The issue also occurs when running on a Redhat cluster GPU. Any thoughts?

The text was updated successfully, but these errors were encountered:

Kaixhin · 2016-07-26T07:53:54Z

Thanks for all the detail - pretty sure that this is a numerical instability issue that can occur with a softmax output.

@lake4790k I'm guessing the fix on line 88 can also be added after line 53?

lordzapharos · 2016-07-27T15:19:51Z

Adding the small epsilon after line 53 does not solve the issue. This looks like it might be a race condition -- setting the number of threads to 1 eliminates the problem entirely (though of course that defeats the whole purpose of A3C). The problem only occurs when two or more threads are being used.

lake4790k · 2016-07-27T17:18:10Z

This sounds like the network params get infected with nans like earlier. There were 2 causes for this, but both had been fixed (proper sharedRmsProp logic and the proper tiny epsilon). It is strange I had a3c running long and didn't get this with current code and you get it quickly. Could be something different with the environment. Maybe check if OpenBLAS threading is not interfering: if you run only one a3c thread only one cpu core should be working. Btw do you have OpenBLAS properly installed and working with Torch?

I still don't have my proper machines set up to run experiments and don't want to torture my laptop for too long with this (after a short run I don't get this), but will put together finally my 4790k and try to reproduce this!

lordzapharos · 2016-07-27T19:02:36Z

That did the trick -- either my version of OpenBLAS was out of date or Torch wasn't recognizing it fully. A fresh install of the current version of OpenBLAS solved the issue.

Thanks for the insights!

lordzapharos closed this as completed Jul 27, 2016

pengsun mentioned this issue Nov 16, 2016

Why is the current sharedRmsprop thread safe? #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async A3C Network Outputs NaN #50

Async A3C Network Outputs NaN #50

lordzapharos commented Jul 26, 2016

Kaixhin commented Jul 26, 2016

lordzapharos commented Jul 27, 2016

lake4790k commented Jul 27, 2016 •

edited

Loading

lordzapharos commented Jul 27, 2016

Async A3C Network Outputs NaN #50

Async A3C Network Outputs NaN #50

Comments

lordzapharos commented Jul 26, 2016

Kaixhin commented Jul 26, 2016

lordzapharos commented Jul 27, 2016

lake4790k commented Jul 27, 2016 • edited Loading

lordzapharos commented Jul 27, 2016

lake4790k commented Jul 27, 2016 •

edited

Loading