Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Large Tensor] Fixed RNN op #17632

Merged
merged 7 commits into from
Mar 4, 2020

Conversation

connorgoggins
Copy link
Contributor

@connorgoggins connorgoggins commented Feb 20, 2020

Description

The RNN op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.RNN, run_backward=True, inputs=[{'data': (2**28,4,4), 'parameters': nd.random_normal(shape=(7,)), 'state': nd.random_normal(shape=(1, 4, 1)), 'mode': 'rnn_relu', 'state_size': 1, 'num_layers': 1}], warmup=1, runs=1)

the following error was thrown:

MXNetError: Check failed: dim_size >= -1 (-2147483640 vs. -1) : shape dim size must be >= -1, while received -2147483640

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in several of the function definitions of rnn-inl.h. Several of the data variables (input_size, batch_size, and seq_length) used the int dtype when they should have been using index_t to properly handle long int dimensions. I switched these variables to index_t in the relevant function headers and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - RNN
INFO:root:Complete Benchmark - RNN
[{'RNN': [{'inputs': {'data': (268435456, 4, 4), 'parameters': '<NDArray 7 @cpu(0)>', 'state': '<NDArray 1x4x1 @cpu(0)>', 'mode': 'rnn_relu', 'state_size': 1, 'num_layers': 1}, 'max_storage_mem_alloc_cpu/0': 27917288.0, 'avg_time_forward_RNN': 1244353.25, 'avg_time_backward_RNN': 1345001.375}]}]

However, this only accounted for running the RNN op in relu mode. The op currently supports 3 other modes: rnn_tanh, lstm, and gru. To ensure that the op worked with large tensor data in each of these three modes in addition to relu mode, I made extensive modifications to rnn_impl.h. My modifications involved changing the data type of many function parameters and local variables for forward and backward functions for these three modes.

To ensure completeness and to prevent future breaking changes, I also added a nightly test for the RNN op with large tensor data in tests/nightly/test_large_array.py.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • M src/operator/rnn-inl.h
  • M src/operator/rnn_impl.h
  • M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

  1. Individual op run (for RNN op under each of the four modes)
  2. Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - RNN ReLU op (GPU)
Single operator test - RNN ReLU op (CPU)
Single operator test - RNN tanh op (CPU)
Single operator test - RNN LSTM op (CPU)
Single operator test - RNN GRU op (CPU)

Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

@connorgoggins
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 20, 2020
@@ -123,7 +123,7 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {
};

inline int GetRnnParamSize(int num_layer,
int input_size,
index_t input_size,
Copy link
Contributor

@access2rohit access2rohit Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t ? make sure API signature doesn't chage. If thats the case then keep it index_t

inline size_t GetRNNWorkspaceSize(int seq_length,
int batch_size,
inline size_t GetRNNWorkspaceSize(index_t seq_length,
index_t batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can batch_size be -ve ? @apeforest what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int input_size,
const index_t seq_length,
const index_t batch_size,
const index_t input_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did u check that "seq_length, batch_size, input_size" are index_t in functions LstmForwardTraining, GruForwardTraining, VanillaRNNForwardTraining ? If so can you let me know here, else you may need to update them too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, updating now.

const int input_size,
const index_t seq_length,
const index_t batch_size,
const index_t input_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did u check that "seq_length, batch_size, input_size" are index_t in functions LstmBackwardTraining, GruBackwardTraining, VanillaRNNBackwardTraining ? If so can you let me know here, else you may need to update them too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, updating now.

out = nd.RNN(data=data, parameters=parameters, state=state, mode=mode,
state_size=state_size, num_layers=num_layers)

assert out.shape[0] == 268435456
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use constants for constant values. Its good practise and you may need to re-use them in future tests too.

@connorgoggins connorgoggins force-pushed the fix_rnn_large_tensor branch 3 times, most recently from b58a51a to 556dbff Compare February 26, 2020 19:26
@@ -140,14 +140,14 @@ inline int GetRnnParamSize(int num_layer,
size *= 3;
break;
}
int size1 = (input_size + state_size + 2) * size; // first layer size
int size2 = (state_size * direction + state_size + 2) * size; // other layers size
index_t size1 = (input_size + state_size + 2) * size; // first layer size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets prefer size_t for sizes. Or do you think these values can be negative too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - testing changes now.

int seq_length,
int batch_size,
index_t seq_length,
index_t batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t? if its not a breaking chnage

Comment on lines +281 to +284
const index_t seq_length,
const index_t batch_size,
const index_t input_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t? if its not a breaking chnage

Comment on lines +323 to +326
const index_t seq_length,
const index_t batch_size,
const index_t input_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t? if its not a breaking chnage

Comment on lines +364 to +368
const index_t seq_length,
const index_t batch_size,
const index_t input_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t? if its not a breaking chnage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the signed index_t since all the functions being called using signed and the omp loop requires a signed index as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with keeping the signed index_t 💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @apeforest! I believe the omp loop’s required signed index was the root cause of the segfault when I made the size_t changes.

@@ -127,9 +127,9 @@ void LstmForwardTraining(DType* ws,
bool state_outputs,
const int L,
const int D,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are D and L ? can D*L be >5B ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L is num_layers, and D is direction. I believe we agreed that we would not support > 2**32 layers, so L should be fine as an int. D can have two possible values, 0 or 1, to indicate whether to run the op with bidirectional recurrent layers. Consequently, D*L can never be >5B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair to keep D and L as int then.

@@ -146,15 +146,15 @@ void LstmForwardTraining(DType* ws,
Tensor<cpu, 3, DType> hx(hx_ptr, Shape3(total_layers, N, H));
Tensor<cpu, 3, DType> cx(cx_ptr, Shape3(total_layers, N, H));
const int b_size = 2 * H * 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline: H represents the LSTM state size, and we are not supporting LSTM states w/dimension >= 2**32, so b_size should remain an int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, this is also the reason that hidden_size is remaining to be int type, right? If so, b_size here, representing the total size of i2h/h2h bias of four gates, still has some overflow risks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zixuanweeei thanks for your feedback. I’m happy to bump b_size up to index_t here if there are overflow concerns.

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the changes pointed by rohit, rest LGTM

@TaoLv
Copy link
Member

TaoLv commented Feb 29, 2020

@zixuanweeei Could you please take a look at the changes? Seems need coordinate with the changes in #17702.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Mar 1, 2020

@zixuanweeei Could you please take a look at the changes? Seems need coordinate with the changes in #17702.

Let's wait for @connorgoggins updating type to size_t for some variables representing the size and his feedback. Overall looks good. We can get this merged first. Due to the fact that the hidden_size/state_size are still remaining to int type, I think there is not much stuff to do for projection_size if no further concern.

@connorgoggins
Copy link
Contributor Author

@zixuanweeei thanks for your feedback! After testing the size_t changes on a separate branch, I found that they resulted in a segmentation fault (even on low-dimensional input). When I ran the op on the same input with my index_t changes (as they exist in the current state of this PR), the op passed without errors. My index_t changes also allow the op to run successfully on large tensor (dimension >= 2^32) input.

With these considerations in mind, we are discussing the best way to move forward.

@@ -123,7 +123,7 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {
};

inline int GetRnnParamSize(int num_layer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return index_t type? Maybe overflow with a large input size. And the UT only covers cases for large sequence length (first dimension of RNN input data). Would you mind taking some tests for a case with a large input to see whether this function still works?

@zixuanweeei
Copy link
Contributor

@connorgoggins Just curious about the reason for the segfault. I don't have much knowledge about that. But I guess it may caused by for (size_t t = T - 1; t >= 0; --t) {} in the Backward pass.

@connorgoggins
Copy link
Contributor Author

@zixuanweeei you're absolutely right - the segfault was generated on line 2032 of rnn_impl.h during the backward pass when I ran the op in ReLU mode. This line lies within the iteration section of the omp loop and, as @apeforest astutely pointed out, the omp loop requires a signed index, which led to errors when the size_t changes were implemented.

@zixuanweeei
Copy link
Contributor

@zixuanweeei you're absolutely right - the segfault was generated on line 2032 of rnn_impl.h during the backward pass when I ran the op in ReLU mode. This line lies within the iteration section of the omp loop and, as @apeforest astutely pointed out, the omp loop requires a signed index, which led to errors when the size_t changes were implemented.

Thanks for trying out the size_t type. Let's keep the signed index_t.

@connorgoggins
Copy link
Contributor Author

@apeforest @zixuanweeei I believe my latest commit incorporates all of the changes we have discussed. The b_size variable and GetRnnParamSize function are now of type index_t, and the changes have been successfully tested on both small and large tensor input.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution!

@apeforest apeforest merged commit 5cffa74 into apache:master Mar 4, 2020
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request May 18, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request May 18, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request May 27, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request May 28, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request May 29, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request Jun 1, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request Jun 2, 2020
* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments
pengzhao-intel pushed a commit that referenced this pull request Jun 3, 2020
…8316)

* [Large Tensor] Backport of Fixed RNN op (#17632)

* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments

* [v1.7.x] Backport of Fix LSTM and GRU layers gradient calculations (#18203)

* Fix input gradient calculation for bidirectional LSTM

For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect.
Reason of wrong calculations was overwriting y derivative (dy) tensor by
calculated x derivative (dx) tensor before right2left layer could use dy for own
gradient calculations.
Propsed fix uses additional space to avoid overwriting.

* Fix gradient calculation for GRU

For GRU with number of layers > 2 i2h_weight gradient for
layers in the middle (all except last and first) was incorrect.
Wrong caluculations were caused by assigning output pointer to
input instead of calculating new input pointer.

* Enable tests for GRU and LSTM gradients

* Fix comments

* Change loop iteration deduction

* Add more test cases for fused rnn layers

Co-authored-by: Connor Goggins <[email protected]>
pengzhao-intel pushed a commit that referenced this pull request Jun 3, 2020
* [v1.x] [Large Tensor] Backport of Fixed RNN op (#17632)

* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments

* [v1.x] Backport of Fix LSTM and GRU layers gradient calculations (#18203)

* Fix input gradient calculation for bidirectional LSTM

For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect.
Reason of wrong calculations was overwriting y derivative (dy) tensor by
calculated x derivative (dx) tensor before right2left layer could use dy for own
gradient calculations.
Propsed fix uses additional space to avoid overwriting.

* Fix gradient calculation for GRU

For GRU with number of layers > 2 i2h_weight gradient for
layers in the middle (all except last and first) was incorrect.
Wrong caluculations were caused by assigning output pointer to
input instead of calculating new input pointer.

* Enable tests for GRU and LSTM gradients

* Fix comments

* Change loop iteration deduction

* Add more test cases for fused rnn layers

Co-authored-by: Connor Goggins <[email protected]>
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
…17632) (apache#18317)

* [v1.x] [Large Tensor] Backport of Fixed RNN op (apache#17632)

* Changed relevant function args to index_t

* Added nightly test for RNN

* Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

* Using const instead of literals

* Added nightly test for RNN ReLU & tanh, LSTM, GRU

* Type assertion to force evaluation of output NDArray

* Incorporated latest round of comments

* [v1.x] Backport of Fix LSTM and GRU layers gradient calculations (apache#18203)

* Fix input gradient calculation for bidirectional LSTM

For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect.
Reason of wrong calculations was overwriting y derivative (dy) tensor by
calculated x derivative (dx) tensor before right2left layer could use dy for own
gradient calculations.
Propsed fix uses additional space to avoid overwriting.

* Fix gradient calculation for GRU

For GRU with number of layers > 2 i2h_weight gradient for
layers in the middle (all except last and first) was incorrect.
Wrong caluculations were caused by assigning output pointer to
input instead of calculating new input pointer.

* Enable tests for GRU and LSTM gradients

* Fix comments

* Change loop iteration deduction

* Add more test cases for fused rnn layers

Co-authored-by: Connor Goggins <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants