Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

RNNOp to call cudaEventCreate lazily #16584

Merged
merged 2 commits into from
Oct 23, 2019

Conversation

DickJC123
Copy link
Contributor

Description

This PR addresses a problem noticed when reviewing issue #16572, and should fix the issue as well, as will be verified by @haojin2 (thanks!).

Recent PR #16391 introduced a cudaEvent to solve a race condition in the cuDNN implementation of RNNOp under some conditions. If the MXNet framework was compiled to use CUDA/CUDNN, this cudaEvent would be 'created' under all scenarios, including non-GPU RNNOps and on systems with no GPU present. The cudaEventCreateWithFlags() call however cannot be made on systems with no GPU present.

This PR makes the cudaEventCreateWithFlags() lazily, only when the event is first used (so necessarily then on a system with a GPU). Further, the thread that creates the event will have the GPU context set properly for any later calls to cudaEventRecord(). It is likely the case that on a multi-GPU setting, the main Python thread would have the context set improperly for later use of the event on an arbitrary GPU (so the issue mentioned).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [X ] Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [ X] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@haojin2 haojin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking the PR temporarily due to the pending verification, will raise the block ASAP when verification is done, @DickJC123 Thanks for your quick response to this issue!

@haojin2 haojin2 self-requested a review October 23, 2019 04:35
Copy link
Contributor

@haojin2 haojin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix verified, now merging the PR

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants