Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

cuDNN non-persistant bidirectional RNN dgrad sync fix #16391

Merged
merged 6 commits into from
Oct 10, 2019

Conversation

DickJC123
Copy link
Contributor

Description

Background: A non-deterministic failure of test_operator_gpu.py:test_lstm_bidirectional that was observed just once in our own CI was found to be due to cuDNN's RNN dgrad implementation. cuDNN launches many kernels as part of the RNN dgrad into various auxiliary streams that are different than the primary user-settable stream of the cuDNN handle. A final kernel launched by cuDNN into one of these aux streams was not being synchronized (via events) back to the handle's stream. When MXNet's RNN::Backward() returns, a gradient summation kernel can be launched by the GPU worker into its main stream and this kernel's execution could potentially overlap or even precede that of the final cuDNN RNN dgrad kernel. MXNet's calling of cuDNN's wgrad immediately after dgrad makes this data-race failure exceedingly rare, and in fact we discovered the problem by code inspection, not by being able to reproduce the original CI failure.

The first commit of this PR will demonstrate the failure. The approach is to have MXNet now skip the RNN wgrad operation if grad_req = {'data':'add', 'parameters':'none'}, plus expand the test_lstm_bidirectional test to invoke this case. Skipping wgrad aggravates the data race by allowing MXNet to launch the gradient summation kernel sooner after the RNN dgrad kernels are launched with no intervening wgrad gpu activity.

Once the failure is solidly demonstrated, a follow-up commit will supply the fix. The fix will use cuda events directed at the legacy default stream to ensure all dgrad GPU activity is complete before the wgrad or other kernels begin. Unlike a cudaDeviceSynchronize(), the fix will not block the CPU. nvprof-based timing analysis of the fix shows no measurable difference in timing for the single-RNN case analyzed.

The fix is needed for all versions of cuDNN supported by MXNet (so up to the current v7.6.4) and is needed only for non-persistent bidirectional RNNs.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@DickJC123 DickJC123 requested a review from ptrendx October 8, 2019 21:00
src/operator/rnn-inl.h Outdated Show resolved Hide resolved
Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DickJC123
Copy link
Contributor Author

I've rebased the commits of the PR to latest master in an attempt to avoid pylint failures, as suggested by @reminisce . If this succeeds, does this suggest a bug in the way the CI creates the repo-under-test?

@ptrendx ptrendx merged commit a2018ba into apache:master Oct 10, 2019
aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this pull request Oct 16, 2019
* Alter test_lstm_bidirectional to demo fast-fail with optional wgrad.

* Fix cuDNN RNN dgrad sync.

* Simplify gpu activity sync sequence.

* Remove repeated running of now-passing test.

* Trigger CI
if (CUDNN_VERSION <= 7604 && dgrad_sync_needed_) {
// Without blocking the CPU, create a synchronization point of all current GPU activity. No
// need to call cudaStreamWaitEvent- cudaEventRecord on the legacy default stream suffices.
CUDA_CALL(cudaEventRecord(dgrad_sync_event_, cudaStreamLegacy));
Copy link
Contributor

@haojin2 haojin2 Oct 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DickJC123, I'm encountering cudaErrorInvalidResourceHandle error here when I'm trying to run this notebook and this notebook in dive into deep learning textbook. Could you help with a fix to that?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants