[Large Tensor] Fixed RNN op #17632

connorgoggins · 2020-02-20T00:21:19Z

Description

The RNN op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.RNN, run_backward=True, inputs=[{'data': (2**28,4,4), 'parameters': nd.random_normal(shape=(7,)), 'state': nd.random_normal(shape=(1, 4, 1)), 'mode': 'rnn_relu', 'state_size': 1, 'num_layers': 1}], warmup=1, runs=1)

the following error was thrown:

MXNetError: Check failed: dim_size >= -1 (-2147483640 vs. -1) : shape dim size must be >= -1, while received -2147483640

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in several of the function definitions of rnn-inl.h. Several of the data variables (input_size, batch_size, and seq_length) used the int dtype when they should have been using index_t to properly handle long int dimensions. I switched these variables to index_t in the relevant function headers and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - RNN
INFO:root:Complete Benchmark - RNN
[{'RNN': [{'inputs': {'data': (268435456, 4, 4), 'parameters': '<NDArray 7 @cpu(0)>', 'state': '<NDArray 1x4x1 @cpu(0)>', 'mode': 'rnn_relu', 'state_size': 1, 'num_layers': 1}, 'max_storage_mem_alloc_cpu/0': 27917288.0, 'avg_time_forward_RNN': 1244353.25, 'avg_time_backward_RNN': 1345001.375}]}]

However, this only accounted for running the RNN op in relu mode. The op currently supports 3 other modes: rnn_tanh, lstm, and gru. To ensure that the op worked with large tensor data in each of these three modes in addition to relu mode, I made extensive modifications to rnn_impl.h. My modifications involved changing the data type of many function parameters and local variables for forward and backward functions for these three modes.

To ensure completeness and to prevent future breaking changes, I also added a nightly test for the RNN op with large tensor data in tests/nightly/test_large_array.py.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/rnn-inl.h
M src/operator/rnn_impl.h
M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run (for RNN op under each of the four modes)
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - RNN ReLU op (GPU)
Single operator test - RNN ReLU op (CPU)
Single operator test - RNN tanh op (CPU)
Single operator test - RNN LSTM op (CPU)
Single operator test - RNN GRU op (CPU)

Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

connorgoggins · 2020-02-20T00:29:07Z

@mxnet-label-bot add [pr-awaiting-review]

access2rohit · 2020-02-24T21:37:34Z

src/operator/rnn-inl.h

@@ -123,7 +123,7 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {
 };

 inline int GetRnnParamSize(int num_layer,
-                           int input_size,
+                           index_t input_size,


Are you sure you don't have to change size1, size2, proj_size and param_size:

https://github.com/apache/incubator-mxnet/pull/17632/files#diff-6dfdca409e69cc495f286170fe1e553eR143

https://github.com/apache/incubator-mxnet/pull/17632/files#diff-6dfdca409e69cc495f286170fe1e553eR152

Good point, fixing.

size_t ? make sure API signature doesn't chage. If thats the case then keep it index_t

access2rohit · 2020-02-24T21:38:49Z

src/operator/rnn-inl.h

-inline size_t GetRNNWorkspaceSize(int seq_length,
-                                  int batch_size,
+inline size_t GetRNNWorkspaceSize(index_t seq_length,
+                                  index_t batch_size,


Can batch_size be -ve ? @apeforest what do you think ?

@apeforest ?

access2rohit · 2020-02-24T21:42:34Z

src/operator/rnn-inl.h

-                        const int input_size,
+                        const index_t seq_length,
+                        const index_t batch_size,
+                        const index_t input_size,


Did u check that "seq_length, batch_size, input_size" are index_t in functions LstmForwardTraining, GruForwardTraining, VanillaRNNForwardTraining ? If so can you let me know here, else you may need to update them too

Excellent point, updating now.

access2rohit · 2020-02-24T21:43:03Z

src/operator/rnn-inl.h

-                 const int input_size,
+                 const index_t seq_length,
+                 const index_t batch_size,
+                 const index_t input_size,


Did u check that "seq_length, batch_size, input_size" are index_t in functions LstmBackwardTraining, GruBackwardTraining, VanillaRNNBackwardTraining ? If so can you let me know here, else you may need to update them too

Excellent point, updating now.

access2rohit · 2020-02-24T21:43:47Z

tests/nightly/test_large_array.py

+        out = nd.RNN(data=data, parameters=parameters, state=state, mode=mode,
+                     state_size=state_size, num_layers=num_layers)
+
+        assert out.shape[0] == 268435456


Please use constants for constant values. Its good practise and you may need to re-use them in future tests too.

access2rohit · 2020-02-27T21:55:04Z

src/operator/rnn-inl.h

@@ -140,14 +140,14 @@ inline int GetRnnParamSize(int num_layer,
      size *= 3;
      break;
  }
-  int size1 = (input_size + state_size + 2) * size;  // first layer size
-  int size2 = (state_size * direction + state_size + 2) * size;  // other layers size
+  index_t size1 = (input_size + state_size + 2) * size;  // first layer size


Lets prefer size_t for sizes. Or do you think these values can be negative too ?

Agreed - testing changes now.

access2rohit · 2020-02-27T22:02:57Z

src/operator/rnn-inl.h

-                                     int seq_length,
-                                     int batch_size,
+                                     index_t seq_length,
+                                     index_t batch_size,


size_t? if its not a breaking chnage

access2rohit · 2020-02-27T22:03:31Z

src/operator/rnn-inl.h

+                        const index_t seq_length,
+                        const index_t batch_size,
+                        const index_t input_size,


size_t? if its not a breaking chnage

access2rohit · 2020-02-27T22:03:45Z

src/operator/rnn-inl.h

+                         const index_t seq_length,
+                         const index_t batch_size,
+                         const index_t input_size,


size_t? if its not a breaking chnage

access2rohit · 2020-02-27T22:04:05Z

src/operator/rnn-inl.h

+                 const index_t seq_length,
+                 const index_t batch_size,
+                 const index_t input_size,


size_t? if its not a breaking chnage

I think we can keep the signed index_t since all the functions being called using signed and the omp loop requires a signed index as well.

Agree with keeping the signed index_t 💯

I agree @apeforest! I believe the omp loop’s required signed index was the root cause of the segfault when I made the size_t changes.

access2rohit · 2020-02-27T22:06:05Z

src/operator/rnn_impl.h

@@ -127,9 +127,9 @@ void LstmForwardTraining(DType* ws,
                         bool state_outputs,
                         const int L,
                         const int D,


What are D and L ? can D*L be >5B ?

L is num_layers, and D is direction. I believe we agreed that we would not support > 2**32 layers, so L should be fine as an int. D can have two possible values, 0 or 1, to indicate whether to run the op with bidirectional recurrent layers. Consequently, D*L can never be >5B.

fair to keep D and L as int then.

access2rohit · 2020-02-27T22:06:17Z

src/operator/rnn_impl.h

@@ -146,15 +146,15 @@ void LstmForwardTraining(DType* ws,
  Tensor<cpu, 3, DType> hx(hx_ptr, Shape3(total_layers, N, H));
  Tensor<cpu, 3, DType> cx(cx_ptr, Shape3(total_layers, N, H));
  const int b_size = 2 * H * 4;


As discussed offline: H represents the LSTM state size, and we are not supporting LSTM states w/dimension >= 2**32, so b_size should remain an int.

If I understand it correctly, this is also the reason that hidden_size is remaining to be int type, right? If so, b_size here, representing the total size of i2h/h2h bias of four gates, still has some overflow risks.

@zixuanweeei thanks for your feedback. I’m happy to bump b_size up to index_t here if there are overflow concerns.

ChaiBapchya

Apart from the changes pointed by rohit, rest LGTM

TaoLv · 2020-02-29T13:21:45Z

@zixuanweeei Could you please take a look at the changes? Seems need coordinate with the changes in #17702.

zixuanweeei · 2020-03-01T02:52:15Z

@zixuanweeei Could you please take a look at the changes? Seems need coordinate with the changes in #17702.

Let's wait for @connorgoggins updating type to size_t for some variables representing the size and his feedback. Overall looks good. We can get this merged first. Due to the fact that the hidden_size/state_size are still remaining to int type, I think there is not much stuff to do for projection_size if no further concern.

connorgoggins · 2020-03-02T19:58:50Z

@zixuanweeei thanks for your feedback! After testing the size_t changes on a separate branch, I found that they resulted in a segmentation fault (even on low-dimensional input). When I ran the op on the same input with my index_t changes (as they exist in the current state of this PR), the op passed without errors. My index_t changes also allow the op to run successfully on large tensor (dimension >= 2^32) input.

With these considerations in mind, we are discussing the best way to move forward.

zixuanweeei · 2020-03-03T08:28:18Z

src/operator/rnn-inl.h

@@ -123,7 +123,7 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {
 };

 inline int GetRnnParamSize(int num_layer,


Should return index_t type? Maybe overflow with a large input size. And the UT only covers cases for large sequence length (first dimension of RNN input data). Would you mind taking some tests for a case with a large input to see whether this function still works?

zixuanweeei · 2020-03-03T08:58:02Z

@connorgoggins Just curious about the reason for the segfault. I don't have much knowledge about that. But I guess it may caused by for (size_t t = T - 1; t >= 0; --t) {} in the Backward pass.

connorgoggins · 2020-03-03T10:43:25Z

@zixuanweeei you're absolutely right - the segfault was generated on line 2032 of rnn_impl.h during the backward pass when I ran the op in ReLU mode. This line lies within the iteration section of the omp loop and, as @apeforest astutely pointed out, the omp loop requires a signed index, which led to errors when the size_t changes were implemented.

zixuanweeei · 2020-03-03T12:12:57Z

@zixuanweeei you're absolutely right - the segfault was generated on line 2032 of rnn_impl.h during the backward pass when I ran the op in ReLU mode. This line lies within the iteration section of the omp loop and, as @apeforest astutely pointed out, the omp loop requires a signed index, which led to errors when the size_t changes were implemented.

Thanks for trying out the size_t type. Let's keep the signed index_t.

connorgoggins · 2020-03-03T19:18:05Z

@apeforest @zixuanweeei I believe my latest commit incorporates all of the changes we have discussed. The b_size variable and GetRnnParamSize function are now of type index_t, and the changes have been successfully tested on both small and large tensor input.

apeforest

Thanks a lot for your contribution!

* Changed relevant function args to index_t * Added nightly test for RNN * Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh * Using const instead of literals * Added nightly test for RNN ReLU & tanh, LSTM, GRU * Type assertion to force evaluation of output NDArray * Incorporated latest round of comments

…8316) * [Large Tensor] Backport of Fixed RNN op (#17632) * Changed relevant function args to index_t * Added nightly test for RNN * Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh * Using const instead of literals * Added nightly test for RNN ReLU & tanh, LSTM, GRU * Type assertion to force evaluation of output NDArray * Incorporated latest round of comments * [v1.7.x] Backport of Fix LSTM and GRU layers gradient calculations (#18203) * Fix input gradient calculation for bidirectional LSTM For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect. Reason of wrong calculations was overwriting y derivative (dy) tensor by calculated x derivative (dx) tensor before right2left layer could use dy for own gradient calculations. Propsed fix uses additional space to avoid overwriting. * Fix gradient calculation for GRU For GRU with number of layers > 2 i2h_weight gradient for layers in the middle (all except last and first) was incorrect. Wrong caluculations were caused by assigning output pointer to input instead of calculating new input pointer. * Enable tests for GRU and LSTM gradients * Fix comments * Change loop iteration deduction * Add more test cases for fused rnn layers Co-authored-by: Connor Goggins <[email protected]>

* [v1.x] [Large Tensor] Backport of Fixed RNN op (#17632) * Changed relevant function args to index_t * Added nightly test for RNN * Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh * Using const instead of literals * Added nightly test for RNN ReLU & tanh, LSTM, GRU * Type assertion to force evaluation of output NDArray * Incorporated latest round of comments * [v1.x] Backport of Fix LSTM and GRU layers gradient calculations (#18203) * Fix input gradient calculation for bidirectional LSTM For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect. Reason of wrong calculations was overwriting y derivative (dy) tensor by calculated x derivative (dx) tensor before right2left layer could use dy for own gradient calculations. Propsed fix uses additional space to avoid overwriting. * Fix gradient calculation for GRU For GRU with number of layers > 2 i2h_weight gradient for layers in the middle (all except last and first) was incorrect. Wrong caluculations were caused by assigning output pointer to input instead of calculating new input pointer. * Enable tests for GRU and LSTM gradients * Fix comments * Change loop iteration deduction * Add more test cases for fused rnn layers Co-authored-by: Connor Goggins <[email protected]>

…17632) (apache#18317) * [v1.x] [Large Tensor] Backport of Fixed RNN op (apache#17632) * Changed relevant function args to index_t * Added nightly test for RNN * Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh * Using const instead of literals * Added nightly test for RNN ReLU & tanh, LSTM, GRU * Type assertion to force evaluation of output NDArray * Incorporated latest round of comments * [v1.x] Backport of Fix LSTM and GRU layers gradient calculations (apache#18203) * Fix input gradient calculation for bidirectional LSTM For bidiractional LSTM with number of layers > 2 input gradient calculation was incorrect. Reason of wrong calculations was overwriting y derivative (dy) tensor by calculated x derivative (dx) tensor before right2left layer could use dy for own gradient calculations. Propsed fix uses additional space to avoid overwriting. * Fix gradient calculation for GRU For GRU with number of layers > 2 i2h_weight gradient for layers in the middle (all except last and first) was incorrect. Wrong caluculations were caused by assigning output pointer to input instead of calculating new input pointer. * Enable tests for GRU and LSTM gradients * Fix comments * Change loop iteration deduction * Add more test cases for fused rnn layers Co-authored-by: Connor Goggins <[email protected]>

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 20, 2020

connorgoggins force-pushed the fix_rnn_large_tensor branch from 20a9bb8 to 5892246 Compare February 24, 2020 17:49

access2rohit reviewed Feb 24, 2020

View reviewed changes

connorgoggins force-pushed the fix_rnn_large_tensor branch 3 times, most recently from b58a51a to 556dbff Compare February 26, 2020 19:26

access2rohit reviewed Feb 27, 2020

View reviewed changes

ChaiBapchya approved these changes Feb 28, 2020

View reviewed changes

zixuanweeei reviewed Mar 3, 2020

View reviewed changes

connorgoggins added 4 commits March 3, 2020 10:31

Changed relevant function args to index_t

7ec6480

Added nightly test for RNN

94278ab

Added fix for LSTM, GRU, RNN-ReLU, RNN-tanh

4fa7bd5

Using const instead of literals

f865f65

connorgoggins added 3 commits March 3, 2020 10:35

Added nightly test for RNN ReLU & tanh, LSTM, GRU

671ee0a

Type assertion to force evaluation of output NDArray

eb09cf1

Incorporated latest round of comments

999328a

connorgoggins force-pushed the fix_rnn_large_tensor branch from 556dbff to 999328a Compare March 3, 2020 18:39

zixuanweeei approved these changes Mar 4, 2020

View reviewed changes

szha assigned apeforest Mar 4, 2020

apeforest approved these changes Mar 4, 2020

View reviewed changes

apeforest merged commit 5cffa74 into apache:master Mar 4, 2020

bgawrych mentioned this pull request May 18, 2020

[1.7.x] Backport of LSTM and GRU fix (#17898) and RNN op (#17632) #18316

Merged

2 tasks

bgawrych mentioned this pull request May 18, 2020

[1.x] Backport of LSTM and GRU fix (#17898) and RNN op (#17632) #18317

Merged

2 tasks

		@@ -123,7 +123,7 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {
		};

		inline int GetRnnParamSize(int num_layer,

[Large Tensor] Fixed RNN op #17632

[Large Tensor] Fixed RNN op #17632

Conversation

connorgoggins commented Feb 20, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Results

connorgoggins commented Feb 20, 2020

access2rohit Feb 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChaiBapchya left a comment • edited Loading

Choose a reason for hiding this comment

TaoLv commented Feb 29, 2020

zixuanweeei commented Mar 1, 2020 • edited Loading

connorgoggins commented Mar 2, 2020

Choose a reason for hiding this comment

zixuanweeei commented Mar 3, 2020

connorgoggins commented Mar 3, 2020

zixuanweeei commented Mar 3, 2020

connorgoggins commented Mar 3, 2020

apeforest left a comment

Choose a reason for hiding this comment

connorgoggins commented Feb 20, 2020 •

edited

Loading

access2rohit Feb 24, 2020 •

edited

Loading

ChaiBapchya left a comment •

edited

Loading

zixuanweeei commented Mar 1, 2020 •

edited

Loading