Race Conditions on Pascal GPUs? #4858

amithr1 · 2017-02-02T12:00:44Z

Hi All,

I think I may be running into some race conditions with Pascal GPUs. I hit this while am running simple CIFAR tests. The test just hangs once in a while. When I enable NativeEngine, it always passes (in the runs tested so far). However, any threaded engines, it hangs. Wondering if there are any races in the threaded implementations? This only happens with Pascal GPUs.

piiswrong · 2017-02-02T18:06:11Z

which example exactly?

amithr1 · 2017-02-02T18:16:08Z

I tried train_cifar.py and also train_imagenet.py..It doesn't matter which one, both are identical in behavior. If there is any flag to enable that can help to figure out what may be going on?

piiswrong · 2017-02-02T18:34:11Z

@ap-hynninen Have you seen anything like this?

amithr1 · 2017-02-02T19:06:46Z

Also..one more thing is that it happens with latest cuda drivers. The old ones were fine. Not sure if the new drivers are exposing any bugs.

ap-hynninen · 2017-02-02T19:11:02Z

I haven't seen this. I run on gtx-1080, titan x pascal, and P100 regularly. @amithr1 Did you compile with the Pascal sm 60 flags in config.mk? Which GPUs have you tested? If you can give me your hardware and software details.

amithr1 · 2017-02-02T20:47:33Z

yes..I changed the sm 60 flags. The new driver is 375.xx

weiliu89 · 2017-02-09T06:47:15Z

@piiswrong I think I am facing similar issue. The log looks like the following when I am training inception-v3 on imagenet.

INFO:root:Epoch[0] Batch [1400] Speed: 344.78 samples/sec Train-accuracy=0.007031
INFO:root:Epoch[0] Batch [1420] Speed: 348.57 samples/sec Train-accuracy=0.008594
INFO:root:Epoch[0] Batch [1440] Speed: 86.58 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1460] Speed: 33.57 samples/sec Train-accuracy=0.008984
INFO:root:Epoch[0] Batch [1480] Speed: 33.34 samples/sec Train-accuracy=0.009375
INFO:root:Epoch[0] Batch [1500] Speed: 33.14 samples/sec Train-accuracy=0.008984
INFO:root:Epoch[0] Batch [1520] Speed: 34.37 samples/sec Train-accuracy=0.005859
INFO:root:Epoch[0] Batch [1540] Speed: 32.44 samples/sec Train-accuracy=0.005469
INFO:root:Epoch[0] Batch [1560] Speed: 34.14 samples/sec Train-accuracy=0.007031
INFO:root:Epoch[0] Batch [1580] Speed: 32.78 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1600] Speed: 33.75 samples/sec Train-accuracy=0.008203
INFO:root:Epoch[0] Batch [1620] Speed: 34.04 samples/sec Train-accuracy=0.006641
INFO:root:Epoch[0] Batch [1640] Speed: 34.34 samples/sec Train-accuracy=0.008594
INFO:root:Epoch[0] Batch [1660] Speed: 33.63 samples/sec Train-accuracy=0.006641
INFO:root:Epoch[0] Batch [1680] Speed: 33.62 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1700] Speed: 35.33 samples/sec Train-accuracy=0.007422
INFO:root:Epoch[0] Batch [1720] Speed: 33.46 samples/sec Train-accuracy=0.009766
INFO:root:Epoch[0] Batch [1740] Speed: 34.80 samples/sec Train-accuracy=0.010937
INFO:root:Epoch[0] Batch [1760] Speed: 120.96 samples/sec Train-accuracy=0.012109
INFO:root:Epoch[0] Batch [1780] Speed: 366.19 samples/sec Train-accuracy=0.009375

My CPU is i7 5930. Is it enough to feed data for 4 titan x pascal?

@amithr1 How do you enable NativeEngine?

ap-hynninen · 2017-02-09T14:01:48Z

@ptrendx

szha · 2017-09-28T21:54:45Z

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

pakmarkthub · 2018-03-15T20:13:29Z

Have you already found a workaround or a solution?

I encountered similar problem when running train_cifar10.py on four P100 GPUs. I found that compiled with DEBUG=1 also created similar problem, not only limit to using NaiveEngine. I have created my ticket here #10123.

sandeep-krishnamurthy · 2018-05-23T23:10:02Z

Closing this issue in favor of - #10123

szha closed this as completed Sep 28, 2017

marcoabreu reopened this Mar 16, 2018

sandeep-krishnamurthy closed this as completed May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Conditions on Pascal GPUs? #4858

Race Conditions on Pascal GPUs? #4858

amithr1 commented Feb 2, 2017

piiswrong commented Feb 2, 2017

amithr1 commented Feb 2, 2017

piiswrong commented Feb 2, 2017

amithr1 commented Feb 2, 2017

ap-hynninen commented Feb 2, 2017

amithr1 commented Feb 2, 2017

weiliu89 commented Feb 9, 2017

ap-hynninen commented Feb 9, 2017

szha commented Sep 28, 2017

pakmarkthub commented Mar 15, 2018

sandeep-krishnamurthy commented May 23, 2018

Race Conditions on Pascal GPUs? #4858

Race Conditions on Pascal GPUs? #4858

Comments

amithr1 commented Feb 2, 2017

piiswrong commented Feb 2, 2017

amithr1 commented Feb 2, 2017

piiswrong commented Feb 2, 2017

amithr1 commented Feb 2, 2017

ap-hynninen commented Feb 2, 2017

amithr1 commented Feb 2, 2017

weiliu89 commented Feb 9, 2017

ap-hynninen commented Feb 9, 2017

szha commented Sep 28, 2017

pakmarkthub commented Mar 15, 2018

sandeep-krishnamurthy commented May 23, 2018