cudaErrorInvalidResourceHandle error when running some RNN models #16572

haojin2 · 2019-10-21T21:34:07Z

Both this notebook and this notebook in dive into deep learning textbook errors with cudaErrorInvalidResourceHandle on line 1505 of src/operator/rnn-inl.h.

The text was updated successfully, but these errors were encountered:

haojin2 · 2019-10-21T21:35:02Z

I'm using CUDA 10.1 + CUDNN 7.6.4 on 4 V100 GPUs (p3.8xlarge instances). Reverting #16391 got both notebooks back to normal.

haojin2 · 2019-10-21T21:35:52Z

@DickJC123 @ptrendx Could you guys help with this issue here?

DickJC123 · 2019-10-22T20:52:57Z

I will take a look at this. Thanks for pointing this out.

DickJC123 · 2019-10-22T22:20:18Z

Glancing over the code knowing now of this issue, I see a problem that may be related. If the code has been compiled for CUDA/CUDNN use, and then an RNNOp is instantiated on a system with no GPU, the code will fail at the call to cudaEventCreateWithFlags() in the constructor. Calling this lazily on first use (on a gpu) would solve this. I suspect the approach will also fix the issues with the notebooks that you point out, although I have not verified this.

Are you blocked by this? It might take me a day or two to make a proper PR with the fix properly verified.

haojin2 · 2019-10-22T22:38:29Z

@DickJC123 Actually the 1.6.0 release could be blocked by this. Please do lemme know if you need any help on verification/reproduction/fixing so that we could streamline your fix. Thanks!

haojin2 · 2019-10-22T22:44:42Z

@DickJC123 BTW I mentioned earlier that I was running with 4 V100 GPUs.

DickJC123 · 2019-10-22T23:43:56Z

@ptrendx has postulated that the problem involves having the main python thread create the RNNOp (and its held cuda event) with either no context or a GPU-0 context, then having the event recorded on a stream of a different GPU. I will be pushing a fix momentarily that delays creating the cuda event until first use, which should correct this scenario.

I have verified the PR reinstates proper behavior when run on a system with no GPU. I appreciate your offer to see if the PR cures this issue you raised with the notebooks. Thanks!

haojin2 · 2019-10-23T04:38:12Z

@DickJC123 Fix merged, now closing this issue.

haojin2 added Bug RNN labels Oct 21, 2019

haojin2 assigned DickJC123 and haojin2 Oct 21, 2019

haojin2 added the CUDA label Oct 21, 2019

reminisce added the R1.6.0 label Oct 22, 2019

DickJC123 mentioned this issue Oct 23, 2019

RNNOp to call cudaEventCreate lazily #16584

Merged

5 tasks

haojin2 closed this as completed Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaErrorInvalidResourceHandle error when running some RNN models #16572

cudaErrorInvalidResourceHandle error when running some RNN models #16572

haojin2 commented Oct 21, 2019

haojin2 commented Oct 21, 2019 •

edited

Loading

haojin2 commented Oct 21, 2019

DickJC123 commented Oct 22, 2019

DickJC123 commented Oct 22, 2019

haojin2 commented Oct 22, 2019

haojin2 commented Oct 22, 2019

DickJC123 commented Oct 22, 2019

haojin2 commented Oct 23, 2019

cudaErrorInvalidResourceHandle error when running some RNN models #16572

cudaErrorInvalidResourceHandle error when running some RNN models #16572

Comments

haojin2 commented Oct 21, 2019

haojin2 commented Oct 21, 2019 • edited Loading

haojin2 commented Oct 21, 2019

DickJC123 commented Oct 22, 2019

DickJC123 commented Oct 22, 2019

haojin2 commented Oct 22, 2019

haojin2 commented Oct 22, 2019

DickJC123 commented Oct 22, 2019

haojin2 commented Oct 23, 2019

haojin2 commented Oct 21, 2019 •

edited

Loading