-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741
Conversation
what's the reason to open a new PR instead of the previous one? |
@pengzhao-intel I incorrectly used |
Thanks for the explanation. |
Do all comments in the original thread are resolved? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and will merge tomorrow if there are no other comments.
Yes, all comments are resolved. @TaoLv Could you check on this PR for LBR-GRU integration again? Specifically, the type of input params of and using reference or pointer to access a |
#pragma omp parallel for num_threads(omp_threads) | ||
for (int i = 0; i < I * H; i++) { | ||
for (int i = 0; i < input_size * hidden_size; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to move this expression input_size * hidden_size
ahead of for
loop.
const int single_cell_size = N * H; | ||
const int single_b_size = ngates * H; | ||
int w_size = (I + H) * H * ngates; | ||
const int cell_size = batch_size * hidden_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change all these sizes from int
to size_t
?
mkldnn_mems->hcx_memory[layer_index], mkldnn_mems->wx_memory[layer_index], | ||
mkldnn_mems->wh_memory[layer_index], mkldnn_mems->bias_memory[layer_index], | ||
mkldnn_mems->y_memory[layer_index], | ||
mkldnn_mems->hcy_memory[layer_index], null_memory_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:indent.
if (mode == rnn_enum::kGru) { | ||
const int mx_single_b_sz = ngates * hidden_size; | ||
for (int l = 0; l < num_layer; l++) { | ||
#pragma omp parallel for num_threads(omp_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use collapse(2) for these two for loops instead of only the inner loop. But note that Microsoft Visual C++ compiler might not support collapse
, which could be separated by Macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noting that.
if (mode == rnn_enum::kLstm) { | ||
for (int l = 0; l < L; l++) { | ||
for (int l = 0; l < num_layer; l++) { | ||
offset1 = l * single_cell_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make offset1
and offset2
more readable?
@pengzhao-intel This PR might also contain flaky unit tests with gpu context. For instance, test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15741/4/pipeline test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-15621/2/pipeline test_operator_gpu.test_rnntanh_bidirectional: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15621/2/pipeline test_operator_gpu.test_rnntanh_bidirectional: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15621/8/pipeline test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15741/5/pipeline |
Could you analyze why the flaky test is generated? It's from the numerical difference or algorithm level difference. |
I tried to reproduce the failures on our internal GPU platforms, but all worked well. And it should be noticed that the source code was compiled with cuda-9.0 and cudnn-9.0-linux-x64-v7.1.2, which are older that the oldest version tested by CI (cuda-10.x and cudnn-xxx-v7.6). |
@ciyongch please take a review again :) |
Thanks for the heads up about the changes surrounding gpu code. With the reported flakiness, I'd like to have tomorrow (Friday) to investigate. |
@DickJC123 Thanks for your patience. And FYI, it seems that the possible flaky tests become effective with the edited UTs for RNN variants. I have tried to modify the code following the instructions from #14476 (review). Specifically,
And all the spaces above are allocated once only using |
(cherry picked from commit 1cf63e1)
Cherry picked from commit 1cf63e1 according to #15847 (comment) |
@TaoLv please take a review again and I plan to merge after the CI pass. |
If it still needs lots of efforts to pass ci, we can drop it and wait to our 1.0 upgrade. |
@pengzhao-intel Sure. There are lots of refactor work both on MKL-DNN RNN and naive RNN. At present, MKL-DNN related stuff is under review. Perhaps, we can just drop this PR, and start a new one from current commit on master. |
What does Linear-Before-Reset mean? |
See the different definition of c(t) in GRU and LBR GRU: https://intel.github.io/mkl-dnn/dev_guide_rnn.html#Linear-before-reset-GRU |
closing this PR since we will migrate it with MKL-DNN 1.0. |
Description
Reopen #15621 here. We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. Please take some reviews on this PR.@ciyongch @TaoLv @pengzhao-intel
Performance
We tested the performance of FusedRNN with
mode='gru'
using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with
mode='lstm'
.