-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[WIP][MXNET-107] Fused LSTM implementation for CPU #10104
Conversation
@szha @Jerryzcn @eric-haibin-lin @pengzhao-intel Could you help to review this PR? Need cooperation to refactor cudnn registration. |
src/operator/rnn-inl.h
Outdated
size_t size = 0; | ||
switch (mode) { | ||
case rnn_enum::kRnnRelu: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need error message for unimplemented modes.
src/operator/rnn-inl.h
Outdated
size = (seq_length + 1) * batch_size * hidden_size * 4 + batch_size * hidden_size; | ||
break; | ||
case rnn_enum::kGru: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add default statement for code robustnesss.
src/operator/rnn-inl.h
Outdated
w_ptr, y_ptr, hy_ptr, cy_ptr); | ||
break; | ||
case rnn_enum::kGru: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need error message for unimplemented modes and default statement for switch-case.
src/operator/rnn.cc
Outdated
@@ -19,40 +19,214 @@ | |||
|
|||
/*! | |||
* Copyright (c) 2015 by Contributors | |||
* \file rnn.cc | |||
* \file rnn.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this change
src/operator/rnn.cc
Outdated
* \brief | ||
* \author Sebastian Bodenstein | ||
* \author Sebastian Bodenstein, Shu Zhang([email protected]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove whitespace
src/operator/rnn.cc
Outdated
} | ||
} | ||
static inline int NumVisibleOutputs(const NodeAttrs& attrs) { | ||
const RNNParam& params = nnvm::get<RNNParam>(attrs.parsed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indents in this function.
src/operator/rnn.cc
Outdated
MXNET_REGISTER_OP_PROPERTY(RNN, RNNProp) | ||
.describe("Applies a recurrent layer to input.") | ||
inline static bool RNNStorageType(const nnvm::NodeAttrs& attrs, | ||
const int dev_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indent. Align with the first parameter.
src/operator/rnn.cc
Outdated
|
||
inline static bool BackwardRNNStorageType(const nnvm::NodeAttrs& attrs, | ||
const int dev_mask, | ||
DispatchMode* dispatch_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indent. Align with the first parameter.
Seems |
wh = mx.random.uniform(-1, 1, (4 * H, H), ctx=xpu,dtype=type1) | ||
bx = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1) | ||
bh = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1) | ||
x1.attach_grad() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to manually attach grad??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attach_grad is used to create gradient buffer for these NDArrays here. Do you mean this can be implemented in other ways or do you have any suggestion about this piece of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case use stateful OP, what's your opinion @eric-haibin-lin ?
@@ -1540,6 +1548,7 @@ def check_rnn_layer_w_rand_inputs(layer): | |||
for g, c in zip(gs, cs): | |||
assert_almost_equal(g.asnumpy(), c.asnumpy(), rtol=1e-2, atol=1e-6) | |||
|
|||
@unittest.skip("test fails intermittently. temporarily disabled till it gets fixed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it failing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If USE_CUDNN=1, I think this test will run into cudnn implementation which has been disabled temporarily. We will reopen this test case after we add cudnn back. In fact, building on gpu is failed currently. We are working on the failure.
src/operator/rnn.cc
Outdated
}; | ||
|
||
NNVM_REGISTER_OP(RNN) | ||
.describe(R"code(Applies a recurrent layer to input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide more detailed descriptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will do.
src/operator/rnn-inl.h
Outdated
DType* reserve_space_ptr = out_data[out_expected - 1].dptr<DType>(); | ||
|
||
// allocate temp space | ||
size_t workspace_size = GetRNNWorkspaceSize(param_.seq_length_, param_.batch_size_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
src/operator/rnn-inl.h
Outdated
Tensor<cpu, 1, DType> workspace = ctx.requested[rnn_enum::kTempSpace] | ||
.get_space_typed<cpu, 1, DType>(Shape1(workspace_size), s); | ||
|
||
int direction = param_.bidirectional ? 2 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
I used
to replace collapse |
@Jerryzcn Good suggestion. I will take a try. |
BTW, is there any existing jira issue for RNN implementation? Do I need to create a jira issue for this PR? @eric-haibin-lin @Jerryzcn @szha |
pls create one |
should we create a separate branch for cpu rnn. Once all the changes are checked in, we merge the rnn branch with the master. This way the master won't break people's code. |
@Jerryzcn good idea. @eric-haibin-lin please open an branch |
Happen to be around on github. I created the branch cpu_fused_rnn and updated PR base. |
Thanks @szha, will keep working on this. |
@sherry-zhang Good Job! |
1ee9ee3
to
dde8e23
Compare
@marcoabreu I am working on branch cpu_fused_rnn, but CI fails in sanity check. I doubt that CI environment has been adjusted for master branch, so this branch cannot work properly. Could you help take a look? Thanks. pylint check is passed on my local server but fails in snanity check:
|
Looks like this is not going to make it into 1.2 |
@piiswrong the merged RNN feature supports inference-only LSTM that is compatible with cudnn implementation. Gluon LSTM layer now supports inference-only forwarding with this feature, and the rest of the use cases are still on old code paths, thanks to @Jerryzcn. The merged PR does what it sets out to do better than what previously exists, so it's more than half baked. |
cc @zhiheng-huang as his team will likely be impacted by the decision of reverting Jerry's PR. |
For now, please fork the master branch in your own repository and let collaborators make PRs towards your repository. At the same time, create a PR from your fork towards the master branch to have constant feedback every time a commit to your fork is being made. |
@szha @piiswrong May I have your opinions? I don't have permission to create/delete branchs and redirect this PR to master branch. I can rebase code to master branch if needed. |
I have changed the base branch as requested. We currently have an internal discussion about whether we support feature-branches in the official repository, until then, it would be better to work towards the master to ensure your PR is always receiving the latest updates. I have also retriggered CI. |
Thanks, @marcoabreu! Really understand your concern. Will keep working on this PR. |
Thanks a lot! Please excuse the inconvenience - in case of further problems, feel free to ping me again. |
I feel it difficult to change the existing gluon LSTM layer from normal |
@TaoLv I'll look into this later. I think you can do it similar to RNNCell. |
out = exe.forward(is_train=False) | ||
out[0].wait_to_read() | ||
assert False # should not reach here | ||
except mx.base.MXNetError as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent approach! This will ensure we don't miss it to re-enable the test when we introduce dropout. Great job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Also to ensure the failure happens at a proper position and correct error message is presented. Follow @reminisce 's idea in PR 10844 .
@piiswrong I added a test for dropout. I think this lstm operator is good to merge. Dropout support and hybrid rnn layer are WIP and will be submitted in another PR. I will also rebase #10311 accordingly. |
src/operator/rnn_impl.hpp
Outdated
@@ -0,0 +1,454 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use hpp. Please rename to .h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/operator/rnn_impl.hpp
Outdated
for (int i = 0; i < T; ++i) { | ||
int t = bid ? T - 1 - i : i; | ||
linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true); | ||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use https://github.com/apache/incubator-mxnet/blob/master/src/operator/mxnet_op.h#L435 to get recommended number of threads for openmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/operator/rnn_impl.hpp
Outdated
} | ||
} | ||
} | ||
memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is copy needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One copy is for forward output and the other copy is for the reuse in backward computation.
src/operator/rnn_impl.hpp
Outdated
for (int i = 0; i < T; ++i) { | ||
int t = bid ? T - 1 - i : i; | ||
linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true); | ||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/operator/rnn_impl.hpp
Outdated
const Tensor<cpu, 2, DType>& dcnext = i ? dc : dcx; | ||
const Tensor<cpu, 2, DType>& hnext = i ? htmp : hx; | ||
const Tensor<cpu, 2, DType>& cnext = i ? c[i - 1] : cx; | ||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/operator/rnn_impl.hpp
Outdated
const int row = T * N; | ||
const int col = H * 4; | ||
for (int i = 0; i < row; ++i) { | ||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omp usage may not be efficient here. Operations in this loop is very simple while col usually is less than a few thousands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I will remove this omp temporarily and look for better optimization for this piece of code.
src/operator/rnn_impl.hpp
Outdated
const DType beta1 = 1.0; | ||
const int cell_size = N * H; | ||
if (dhy_ptr != NULL) { | ||
memcpy(dh.dptr_, dhy_ptr, cell_size * sizeof(DType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is copies needed?
data = mx.sym.Variable('data') | ||
|
||
Y1, _ = cell1.unroll(T, data, layout='NTC', merge_outputs=True) | ||
mod1 = mx.mod.Module(Y1, label_names=None, context=mx.cpu()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use default_context() here and remove the corresponding tests in test_operator_gpu. These tests will automatically be run again in test_operator_gpu with default_context() = gpu()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Also I changed the name of test_lstm
to test_lstm_sym
since it would confict with that in /unittest/test_gluon_rnn.py
after imported to test_operator_gpu.py
.
@eric-haibin-lin @piiswrong @szha @Jerryzcn the comments are solved. Please help take a review again. |
DType ft = ifgo[i][j][k][1]; | ||
DType gt = ifgo[i][j][k][2]; | ||
DType ot = ifgo[i][j][k][3]; | ||
dh[j][k] += dy[t][j][k + offset]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dh and dc is never read before they are overwritten. Why do you need the copy at line 341?
} | ||
} | ||
} | ||
memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not write to y_ptr directly for the last layer?
} | ||
Tensor<cpu, 2, DType> dyh(difgo[t].dptr_, Shape2(N, H * 4)); | ||
linalg_gemm(dyh, wh, dhnext, alpha, beta0, false, false); | ||
linalg_gemm(dyh, hnext, dwh, alpha, beta1, true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dwh is overwritten. why do you need to set it to 0 with memset at 328?
I can merge this first. But I think the memset and memcpy statements are superfluous. We should get ride of them later |
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
Description
In this PR, a fused LSTM operator for CPU is implemented. More supports for other RNN variants are WIP and will be submitted in other PRs.
Feature changes
New features
sym.RNN
operatorsym.RNN
operatorRefactor code and register it with NNVM interfacesUnit-test changes
test_lstm
andtest_lstm_bidirectional
intests/python/unittests/test_operator.py
Performance
We have tested performance of
sym.RNN
andrnn.LSTMCell
on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).
single layer measurement:
multi layer measurement: num_layers=5
#10104 /LSTMCell
#10104 /LSTMCell(cuda)
#10104 / sym.RNN(cudnn)
Opens
Fix cudnn registeration in this PRChecklist
Essentials
make lint
)