cuDNN RNN dtype_with_fallback_ calc needs update #16670

DickJC123 · 2019-10-29T17:57:58Z

Description

In src/operator/rnn-inl.h, the calculation of the mathPrec passed to cudnnSetRNNDescriptor_v6() needs review (the variable is dtype_with_fallback_). The current logic is:

#if __CUDA_ARCH__ < 530 && CUDNN_VERSION >= 7500
      if (dtype_ == CUDNN_DATA_HALF) {
        dtype_with_fallback_ = CUDNN_DATA_FLOAT;
      } else {
        dtype_with_fallback_ = dtype_;
      }
#else
        dtype_with_fallback_ = dtype_;
#endif

The use of __CUDA_ARCH__ in determining host-side code behavior is not appropriate (the variable would be undefined in this case and interpreted as 0 in all cases). Thus the intent of the code (to ensure pseudo-fp16 handling of RNNs for Maxwell GPUs even after the API change of cuDNN 7.5) is being applied to all architectures.

My recommendation would be to adopt pseudo-fp16 for all architectures anyway. A possible fix:

    cudnnDataType_t dtype_with_fallback_ =
        (CUDNN_VERSION >= 7500 && dtype_ == CUDNN_DATA_HALF) ? CUDNN_DATA_FLOAT
                                                             : dtype_;

Technically, since cuDNN prior to v7.5 ignores the use mathPrec in cudnnSetRNNDescriptor_v6(), the CUDNN_VERSION >= 7500 term could be dropped with no impact.

While we're fixing this, I think we should consider turning on Tensor Cores by default, or at least giving the user a mechanism to do so.

To Reproduce

To verify my claim about __CUDA_ARCH__, I inserted the following code in rnn-inl.h just before the definition of dtype_with_fallback_:

#ifdef __CUDA_ARCH__
      LOG(INFO) << "************* __CUDA_ARCH__ = " << __CUDA_ARCH__;
#else
      LOG(INFO) << "************* __CUDA_ARCH__ is not defined";
#endif

$ nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_rnn
...
[10:53:00] src/operator/./rnn-inl.h:1388: ************* __CUDA_ARCH__ is not defined
[10:53:00] src/operator/./rnn-inl.h:1388: ************* __CUDA_ARCH__ is not defined

@stu1130 @ptrendx

The text was updated successfully, but these errors were encountered:

stu1130 · 2019-10-29T21:01:59Z

Thanks for the suggestion @DickJC123
I created a quick fix #16671

stu1130 · 2019-11-04T19:50:26Z

@DickJC123 do you think we can close the issue?
My though is that there are two things in the issue, first one was addressed but the the second one which is enabling the Tensor Cores by default is not tackled yet so I kept it open. But feel free to close it and maybe create enabling Tensor Cores issue if you want.

DickJC123 added the Bug label Oct 29, 2019

stu1130 mentioned this issue Oct 29, 2019

Fix cuDNN RNN dtype_with_fallback_ bug #16671

Merged

7 tasks

DickJC123 closed this as completed Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN RNN dtype_with_fallback_ calc needs update #16670

cuDNN RNN dtype_with_fallback_ calc needs update #16670

DickJC123 commented Oct 29, 2019 •

edited

Loading

stu1130 commented Oct 29, 2019

stu1130 commented Nov 4, 2019

cuDNN RNN dtype_with_fallback_ calc needs update #16670

cuDNN RNN dtype_with_fallback_ calc needs update #16670

Comments

DickJC123 commented Oct 29, 2019 • edited Loading

Description

To Reproduce

stu1130 commented Oct 29, 2019

stu1130 commented Nov 4, 2019

DickJC123 commented Oct 29, 2019 •

edited

Loading