RNNOp to call cudaEventCreate lazily #16584

DickJC123 · 2019-10-23T00:07:55Z

Description

This PR addresses a problem noticed when reviewing issue #16572, and should fix the issue as well, as will be verified by @haojin2 (thanks!).

Recent PR #16391 introduced a cudaEvent to solve a race condition in the cuDNN implementation of RNNOp under some conditions. If the MXNet framework was compiled to use CUDA/CUDNN, this cudaEvent would be 'created' under all scenarios, including non-GPU RNNOps and on systems with no GPU present. The cudaEventCreateWithFlags() call however cannot be made on systems with no GPU present.

This PR makes the cudaEventCreateWithFlags() lazily, only when the event is first used (so necessarily then on a system with a GPU). Further, the thread that creates the event will have the GPU context set properly for any later calls to cudaEventRecord(). It is likely the case that on a multi-GPU setting, the main Python thread would have the context set improperly for later use of the event on an arbitrary GPU (so the issue mentioned).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[X ] Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[ X] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

haojin2

Blocking the PR temporarily due to the pending verification, will raise the block ASAP when verification is done, @DickJC123 Thanks for your quick response to this issue!

haojin2

Fix verified, now merging the PR

RNNOp to call cudaEventCreate lazily

8ab2d3b

DickJC123 requested a review from ptrendx October 23, 2019 00:09

ptrendx approved these changes Oct 23, 2019

View reviewed changes

haojin2 suggested changes Oct 23, 2019

View reviewed changes

Saw unrelated failure. Trigger CI.

edd5cc9

haojin2 self-requested a review October 23, 2019 04:35

haojin2 approved these changes Oct 23, 2019

View reviewed changes

haojin2 merged commit b05d72a into apache:master Oct 23, 2019

DickJC123 mentioned this pull request Oct 26, 2019

RNNOp regression on CPU when using a GPU-enabled build #16628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNNOp to call cudaEventCreate lazily #16584

RNNOp to call cudaEventCreate lazily #16584

DickJC123 commented Oct 23, 2019

haojin2 left a comment

haojin2 left a comment

RNNOp to call cudaEventCreate lazily #16584

RNNOp to call cudaEventCreate lazily #16584

Conversation

DickJC123 commented Oct 23, 2019

Description

Checklist

Essentials

Changes

Comments

haojin2 left a comment

Choose a reason for hiding this comment

haojin2 left a comment

Choose a reason for hiding this comment