Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Race Conditions on Pascal GPUs? #4858

Closed
amithr1 opened this issue Feb 2, 2017 · 11 comments
Closed

Race Conditions on Pascal GPUs? #4858

amithr1 opened this issue Feb 2, 2017 · 11 comments

Comments

@amithr1
Copy link

amithr1 commented Feb 2, 2017

Hi All,

I think I may be running into some race conditions with Pascal GPUs. I hit this while am running simple CIFAR tests. The test just hangs once in a while. When I enable NativeEngine, it always passes (in the runs tested so far). However, any threaded engines, it hangs. Wondering if there are any races in the threaded implementations? This only happens with Pascal GPUs.

@piiswrong
Copy link
Contributor

which example exactly?

@amithr1
Copy link
Author

amithr1 commented Feb 2, 2017

I tried train_cifar.py and also train_imagenet.py..It doesn't matter which one, both are identical in behavior. If there is any flag to enable that can help to figure out what may be going on?

@piiswrong
Copy link
Contributor

@ap-hynninen Have you seen anything like this?

@amithr1
Copy link
Author

amithr1 commented Feb 2, 2017

Also..one more thing is that it happens with latest cuda drivers. The old ones were fine. Not sure if the new drivers are exposing any bugs.

@ap-hynninen
Copy link
Contributor

I haven't seen this. I run on gtx-1080, titan x pascal, and P100 regularly. @amithr1 Did you compile with the Pascal sm 60 flags in config.mk? Which GPUs have you tested? If you can give me your hardware and software details.

@amithr1
Copy link
Author

amithr1 commented Feb 2, 2017

yes..I changed the sm 60 flags. The new driver is 375.xx

@weiliu89
Copy link

weiliu89 commented Feb 9, 2017

@piiswrong I think I am facing similar issue. The log looks like the following when I am training inception-v3 on imagenet.

INFO:root:Epoch[0] Batch [1400] Speed: 344.78 samples/sec Train-accuracy=0.007031
INFO:root:Epoch[0] Batch [1420] Speed: 348.57 samples/sec Train-accuracy=0.008594
INFO:root:Epoch[0] Batch [1440] Speed: 86.58 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1460] Speed: 33.57 samples/sec Train-accuracy=0.008984
INFO:root:Epoch[0] Batch [1480] Speed: 33.34 samples/sec Train-accuracy=0.009375
INFO:root:Epoch[0] Batch [1500] Speed: 33.14 samples/sec Train-accuracy=0.008984
INFO:root:Epoch[0] Batch [1520] Speed: 34.37 samples/sec Train-accuracy=0.005859
INFO:root:Epoch[0] Batch [1540] Speed: 32.44 samples/sec Train-accuracy=0.005469
INFO:root:Epoch[0] Batch [1560] Speed: 34.14 samples/sec Train-accuracy=0.007031
INFO:root:Epoch[0] Batch [1580] Speed: 32.78 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1600] Speed: 33.75 samples/sec Train-accuracy=0.008203
INFO:root:Epoch[0] Batch [1620] Speed: 34.04 samples/sec Train-accuracy=0.006641
INFO:root:Epoch[0] Batch [1640] Speed: 34.34 samples/sec Train-accuracy=0.008594
INFO:root:Epoch[0] Batch [1660] Speed: 33.63 samples/sec Train-accuracy=0.006641
INFO:root:Epoch[0] Batch [1680] Speed: 33.62 samples/sec Train-accuracy=0.006250
INFO:root:Epoch[0] Batch [1700] Speed: 35.33 samples/sec Train-accuracy=0.007422
INFO:root:Epoch[0] Batch [1720] Speed: 33.46 samples/sec Train-accuracy=0.009766
INFO:root:Epoch[0] Batch [1740] Speed: 34.80 samples/sec Train-accuracy=0.010937
INFO:root:Epoch[0] Batch [1760] Speed: 120.96 samples/sec Train-accuracy=0.012109
INFO:root:Epoch[0] Batch [1780] Speed: 366.19 samples/sec Train-accuracy=0.009375

My CPU is i7 5930. Is it enough to feed data for 4 titan x pascal?

@amithr1 How do you enable NativeEngine?

@ap-hynninen
Copy link
Contributor

@ptrendx

@szha
Copy link
Member

szha commented Sep 28, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

@szha szha closed this as completed Sep 28, 2017
@pakmarkthub
Copy link

Have you already found a workaround or a solution?

I encountered similar problem when running train_cifar10.py on four P100 GPUs. I found that compiled with DEBUG=1 also created similar problem, not only limit to using NaiveEngine. I have created my ticket here #10123.

@marcoabreu marcoabreu reopened this Mar 16, 2018
@sandeep-krishnamurthy
Copy link
Contributor

Closing this issue in favor of - #10123

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants