Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

cudaErrorInvalidResourceHandle error when running some RNN models #16572

Closed
haojin2 opened this issue Oct 21, 2019 · 8 comments
Closed

cudaErrorInvalidResourceHandle error when running some RNN models #16572

haojin2 opened this issue Oct 21, 2019 · 8 comments

Comments

@haojin2
Copy link
Contributor

haojin2 commented Oct 21, 2019

Both this notebook and this notebook in dive into deep learning textbook errors with cudaErrorInvalidResourceHandle on line 1505 of src/operator/rnn-inl.h.

@haojin2
Copy link
Contributor Author

haojin2 commented Oct 21, 2019

I'm using CUDA 10.1 + CUDNN 7.6.4 on 4 V100 GPUs (p3.8xlarge instances). Reverting #16391 got both notebooks back to normal.

@haojin2
Copy link
Contributor Author

haojin2 commented Oct 21, 2019

@DickJC123 @ptrendx Could you guys help with this issue here?

@DickJC123
Copy link
Contributor

I will take a look at this. Thanks for pointing this out.

@DickJC123
Copy link
Contributor

Glancing over the code knowing now of this issue, I see a problem that may be related. If the code has been compiled for CUDA/CUDNN use, and then an RNNOp is instantiated on a system with no GPU, the code will fail at the call to cudaEventCreateWithFlags() in the constructor. Calling this lazily on first use (on a gpu) would solve this. I suspect the approach will also fix the issues with the notebooks that you point out, although I have not verified this.

Are you blocked by this? It might take me a day or two to make a proper PR with the fix properly verified.

@haojin2
Copy link
Contributor Author

haojin2 commented Oct 22, 2019

@DickJC123 Actually the 1.6.0 release could be blocked by this. Please do lemme know if you need any help on verification/reproduction/fixing so that we could streamline your fix. Thanks!

@haojin2
Copy link
Contributor Author

haojin2 commented Oct 22, 2019

@DickJC123 BTW I mentioned earlier that I was running with 4 V100 GPUs.

@DickJC123
Copy link
Contributor

@ptrendx has postulated that the problem involves having the main python thread create the RNNOp (and its held cuda event) with either no context or a GPU-0 context, then having the event recorded on a stream of a different GPU. I will be pushing a fix momentarily that delays creating the cuda event until first use, which should correct this scenario.

I have verified the PR reinstates proper behavior when run on a system with no GPU. I appreciate your offer to see if the PR cures this issue you raised with the notebooks. Thanks!

@haojin2
Copy link
Contributor Author

haojin2 commented Oct 23, 2019

@DickJC123 Fix merged, now closing this issue.

@haojin2 haojin2 closed this as completed Oct 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants