-
Notifications
You must be signed in to change notification settings - Fork 6.8k
cudaErrorInvalidResourceHandle error when running some RNN models #16572
Comments
I'm using CUDA 10.1 + CUDNN 7.6.4 on 4 V100 GPUs (p3.8xlarge instances). Reverting #16391 got both notebooks back to normal. |
@DickJC123 @ptrendx Could you guys help with this issue here? |
I will take a look at this. Thanks for pointing this out. |
Glancing over the code knowing now of this issue, I see a problem that may be related. If the code has been compiled for CUDA/CUDNN use, and then an RNNOp is instantiated on a system with no GPU, the code will fail at the call to cudaEventCreateWithFlags() in the constructor. Calling this lazily on first use (on a gpu) would solve this. I suspect the approach will also fix the issues with the notebooks that you point out, although I have not verified this. Are you blocked by this? It might take me a day or two to make a proper PR with the fix properly verified. |
@DickJC123 Actually the 1.6.0 release could be blocked by this. Please do lemme know if you need any help on verification/reproduction/fixing so that we could streamline your fix. Thanks! |
@DickJC123 BTW I mentioned earlier that I was running with 4 V100 GPUs. |
@ptrendx has postulated that the problem involves having the main python thread create the RNNOp (and its held cuda event) with either no context or a GPU-0 context, then having the event recorded on a stream of a different GPU. I will be pushing a fix momentarily that delays creating the cuda event until first use, which should correct this scenario. I have verified the PR reinstates proper behavior when run on a system with no GPU. I appreciate your offer to see if the PR cures this issue you raised with the notebooks. Thanks! |
@DickJC123 Fix merged, now closing this issue. |
Both this notebook and this notebook in dive into deep learning textbook errors with
cudaErrorInvalidResourceHandle
on line 1505 ofsrc/operator/rnn-inl.h
.The text was updated successfully, but these errors were encountered: