Fix a memory misalignment in topk operator #15948

apeforest · 2019-08-20T06:40:21Z

Description

Current memory alignment in topk operator is incorrect if index_t is using int64_t. This PR fixes the potential issue. It partially fixes #15703

The PR also fixes an incorrect data type usage in mshadow/tensor.h

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Bug fix

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

apeforest · 2019-08-20T06:41:08Z

@access2rohit @ChaiBapchya Please also help review. Thanks

access2rohit · 2019-08-20T16:07:18Z

3rdparty/mshadow/mshadow/tensor.h

@@ -69,15 +69,15 @@ struct Shape {
   * \param idx dimension index
   * \return the corresponding dimension size
   */
-  MSHADOW_XINLINE index_t &operator[](index_t idx) {
+  MSHADOW_XINLINE index_t &operator[](int idx) {


Can you describe this change a bit more in detail ? Why is this required ?

This is not directly fixing this bug. However, while checking the tensor struct, I noticed this data type should be int because it is indexing the dimension (which is set to int)

src/operator/tensor/ordering_op-inl.h

ChaiBapchya

LGTM!

sxjscience · 2019-08-21T18:35:39Z

src/operator/tensor/ordering_op-inl.h

+  // Temp space needed by the full sorts.
+  size_t temp_size = std::max(
+      mxnet::op::SortByKeyWorkspaceSize<index_t, DType, xpu>(src.Size()),
+      mxnet::op::SortByKeyWorkspaceSize<DType, index_t, xpu>(src.Size()));


I think we should also include mxnet::op::SortByKeyWorkspaceSize<index_t, index_t, xpu>(src.Size()).

sxjscience · 2019-08-21T18:39:29Z

3rdparty/mshadow/mshadow/tensor.h

    return shape_[idx];
  }
  /*!
   * \brief get corresponding index
   * \param idx dimension index
   * \return the corresponding dimension size
   */
-  MSHADOW_XINLINE const index_t &operator[](index_t idx) const {
+  MSHADOW_XINLINE const index_t &operator[](int idx) const {


There are other places in the same file that assumes idx have index_t type. we need to fix all of them.

TaoLv · 2019-08-23T14:06:57Z

Thank you for the fix, @apeforest . Will this be picked to the v1.5.x branch?

* fix alignment * use correct type for shape index * clean up unnecessary space in topk * fix lint * add additional temp space * address reviewer comment * fix incorrect nidex type

apeforest · 2019-08-23T17:07:29Z

@TaoLv I have cherry-picked it to v1.5.x (commit hash: 42746bc)

Somehow the v1.5.x branch was not PR protected so when I push it to origin it got merged automatically. Please let me know if this is okay. Thanks.

TaoLv · 2019-08-24T15:03:20Z

@apeforest Could you please open a dummy PR against the v1.5.x branch to make sure the branch can properly pass the CI?

apeforest · 2019-08-25T08:53:11Z

See #15999

This reverts commit 42746bc.

* Revert "Fix a memory misalignment in topk operator (#15948)" This reverts commit 42746bc.

refer to apache/mxnet#15948

apeforest requested a review from sxjscience August 20, 2019 06:40

apeforest force-pushed the bugfix/topk-oom branch from 35da47d to b431a59 Compare August 20, 2019 06:52

leezu mentioned this pull request Aug 20, 2019

topk regression in v1.5 #15703

Closed

access2rohit reviewed Aug 20, 2019

View reviewed changes

src/operator/tensor/ordering_op-inl.h Show resolved Hide resolved

access2rohit reviewed Aug 20, 2019

View reviewed changes

src/operator/tensor/ordering_op-inl.h Show resolved Hide resolved

apeforest mentioned this pull request Aug 21, 2019

[Discussion] 1.5.1 Patch Release #15613

Closed

apeforest force-pushed the bugfix/topk-oom branch from e658902 to 2bd2cbc Compare August 21, 2019 00:26

ChaiBapchya approved these changes Aug 21, 2019

View reviewed changes

sxjscience reviewed Aug 21, 2019

View reviewed changes

apeforest force-pushed the bugfix/topk-oom branch from 3a1d25a to 040c8bc Compare August 22, 2019 00:31

sxjscience approved these changes Aug 22, 2019

View reviewed changes

apeforest added 7 commits August 22, 2019 14:39

fix alignment

5967e80

use correct type for shape index

b4f9793

clean up unnecessary space in topk

38e41a2

fix lint

8d76ed1

add additional temp space

316443b

address reviewer comment

ec60371

fix incorrect nidex type

f6f0750

apeforest force-pushed the bugfix/topk-oom branch from b7f9786 to f6f0750 Compare August 22, 2019 21:39

sxjscience merged commit 73a692e into apache:master Aug 23, 2019

apeforest deleted the bugfix/topk-oom branch August 23, 2019 05:03

apeforest mentioned this pull request Aug 25, 2019

Revert "Fix a memory misalignment in topk operator" #15999

Merged

apeforest added a commit to apeforest/incubator-mxnet that referenced this pull request Aug 25, 2019

Revert "Fix a memory misalignment in topk operator (apache#15948)"

7f3d032

This reverts commit 42746bc.

sxjscience pushed a commit that referenced this pull request Aug 27, 2019

Revert "Fix a memory misalignment in topk operator" (#15999)

33f4de1

* Revert "Fix a memory misalignment in topk operator (#15948)" This reverts commit 42746bc.

shuokay added a commit to shuokay/mshadow that referenced this pull request Aug 29, 2019

fix mshadow Shape index

b6990b2

refer to apache/mxnet#15948

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a memory misalignment in topk operator #15948

Fix a memory misalignment in topk operator #15948

apeforest commented Aug 20, 2019

apeforest commented Aug 20, 2019

access2rohit Aug 20, 2019

apeforest Aug 20, 2019

ChaiBapchya left a comment

sxjscience Aug 21, 2019

apeforest Aug 21, 2019

sxjscience Aug 21, 2019

apeforest Aug 21, 2019

TaoLv commented Aug 23, 2019

apeforest commented Aug 23, 2019

TaoLv commented Aug 24, 2019

apeforest commented Aug 25, 2019

Fix a memory misalignment in topk operator #15948

Fix a memory misalignment in topk operator #15948

Conversation

apeforest commented Aug 20, 2019

Description

Checklist

Essentials

Changes

Comments

apeforest commented Aug 20, 2019

access2rohit Aug 20, 2019

Choose a reason for hiding this comment

apeforest Aug 20, 2019

Choose a reason for hiding this comment

ChaiBapchya left a comment

Choose a reason for hiding this comment

sxjscience Aug 21, 2019

Choose a reason for hiding this comment

apeforest Aug 21, 2019

Choose a reason for hiding this comment

sxjscience Aug 21, 2019

Choose a reason for hiding this comment

apeforest Aug 21, 2019

Choose a reason for hiding this comment

TaoLv commented Aug 23, 2019

apeforest commented Aug 23, 2019

TaoLv commented Aug 24, 2019

apeforest commented Aug 25, 2019