Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Large Tensor] Fix cumsum op #17677

Merged
merged 8 commits into from
Feb 29, 2020

Conversation

connorgoggins
Copy link
Contributor

@connorgoggins connorgoggins commented Feb 24, 2020

Description

The cumsum op was previously breaking on large tensor (dimension > 2^32) data. With the following input:

run_performance_test(nd.cumsum, inputs=[{"a": nd.random_normal(shape=(2**32 + 1, 1))}], run_backward=True, warmup=1, runs=1)

the following error was thrown:

Segmentation fault (core dumped)

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the data type of several variables in the forward/backward structs in np_cumsum-inl.h.h. These variables used the int dtype when they should have been using index_t to properly handle long int indices. I switched these variables to index_t in the struct header and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - cumsum
INFO:root:Complete Benchmark - cumsum
[{'cumsum': [{'inputs': {'a': '<NDArray 4294967297x1 @cpu(0)>'}, 'max_storage_mem_alloc_cpu/0': 33285996.0, 'avg_time_forward_cumsum': 4366.7148, 'avg_time_backward_cumsum': 12744.9971}]}]

To ensure completeness and to prevent future breaking changes, I also added a nightly test for the cumsum op with large tensor data in tests/nightly/test_large_array.py.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • M src/operator/numpy/np_cumsum-inl.h
  • M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 with

  1. Individual op run
  2. Full OpPerf run

Results

Single operator test - cumsum op (CPU)

Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

@connorgoggins
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 24, 2020
Copy link
Contributor

@haojin2 haojin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@connorgoggins connorgoggins force-pushed the fix_cumsum_large_tensor branch 2 times, most recently from fd32e6a to c065cdc Compare February 25, 2020 18:48
@sandeep-krishnamurthy
Copy link
Contributor

  1. Are there any performance regression associated with this index type change? (Performance on Before and after the change)
  2. Thanks for fix this issue, is there a way to know if there are any other broken operators with similar issue? (Not intended to fix with this PR, but, wanted to see the blast radius and possibly create a tracking issue while we are on it here)

@connorgoggins
Copy link
Contributor Author

@sandeep-krishnamurthy thanks for your questions!

  1. If you look at performance of the cumsum op before vs. after the changes in this PR, you can see that there is no regression on the time required for a forward pass.
  2. My current assignment has been to systematically fix all ops with similar issues (i.e. throwing errors when handling large tensor data). As of now, I have created PRs with fixes for every problematic op, and once all my remaining op fix PRs (3 remaining PRs, including this one) are merged, all similar issues will be resolved.

@connorgoggins connorgoggins force-pushed the fix_cumsum_large_tensor branch from e8dee74 to 8b849f1 Compare February 26, 2020 22:49
@@ -98,11 +98,11 @@ void CumsumForwardImpl(const OpContext& ctx,
}

Stream<xpu> *s = ctx.get_stream<xpu>();
MSHADOW_TYPE_SWITCH_WITH_BOOL(in.type_flag_, IType, {
MSHADOW_TYPE_SWITCH_WITH_BOOL(in.type_flag_, index_t, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will not need this switch macro, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, reverted to IType.

@@ -157,10 +157,10 @@ void CumsumBackwardImpl(const OpContext& ctx,
}
}
Stream<xpu> *s = ctx.get_stream<xpu>();
MSHADOW_TYPE_SWITCH_WITH_BOOL(igrad.type_flag_, IType, {
MSHADOW_TYPE_SWITCH_WITH_BOOL(igrad.type_flag_, index_t, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change as well ? rest LGTM !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, reverted to IType.

MSHADOW_TYPE_SWITCH(ograd.type_flag_, OType, {
Kernel<cumsum_backward, xpu>::Launch(
s, igrad.Size() / middle, igrad.dptr<IType>(),
s, igrad.Size() / middle, igrad.dptr<index_t>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted, thanks!

@@ -157,10 +157,10 @@ void CumsumBackwardImpl(const OpContext& ctx,
}
}
Stream<xpu> *s = ctx.get_stream<xpu>();
MSHADOW_TYPE_SWITCH_WITH_BOOL(igrad.type_flag_, IType, {
MSHADOW_TYPE_SWITCH_WITH_BOOL(igrad.type_flag_, index_t, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a macro if the type is predefined.

Copy link
Contributor

@access2rohit access2rohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@apeforest apeforest merged commit 2527553 into apache:master Feb 29, 2020
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
* Implemented fix and nightly test for cumsum

* Changed IType to index_t

* Also changed in backward

* Reverting to IType

* Added type assertion on first element to force evaluation of output NDArray

* Reverted to IType in relevant places

* Last reversion

* Changed type assertion to value check
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* Implemented fix and nightly test for cumsum

* Changed IType to index_t

* Also changed in backward

* Reverting to IType

* Added type assertion on first element to force evaluation of output NDArray

* Reverted to IType in relevant places

* Last reversion

* Changed type assertion to value check
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants