[Large Tensor] Fixed SoftmaxActivation op #17634

connorgoggins · 2020-02-20T01:42:24Z

Description

The Softmax Activation op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.SoftmaxActivation, run_backward=True, inputs=[{'data': (2**29,2,2,2), 'out': nd.random_normal(shape=(2**29,2,2,2))}], warmup=1, runs=1)

the following error was thrown:

TBlob.get_with_shape: Check failed: this->shape_.Size() == static_cast<size_t>(shape.Size()) (4294967296 vs. 0) : new and old shape do not match total elements

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the shape construction logic of softmax_activation-inl.h. In the functions for computing the forward pass result and the gradient, several of the variables used the int dtype when they should have been using index_t to properly handle long int dimensions. I switched these variables to index_t and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - SoftmaxActivation
INFO:root:Complete Benchmark - SoftmaxActivation
[{'SoftmaxActivation': [{'inputs': {'data': (536870912, 2, 2, 2), 'out': '<NDArray 536870912x2x2x2 @cpu(0)>'}, 'max_storage_mem_alloc_cpu/0': 24696062.0, 'avg_time_forward_SoftmaxActivation': 7426.1191, 'avg_time_backward_SoftmaxActivation': 16664.0254}]}]

To ensure completeness and to prevent future breaking changes, I also added a nightly test for the Softmax Activation op with large tensor data in tests/nightly/test_large_array.py.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/nn/softmax_activation-inl.h
M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - SoftmaxActivation op (GPU)
Single operator test - SoftmaxActivation op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

connorgoggins · 2020-02-20T01:42:50Z

@mxnet-label-bot add [pr-awaiting-review]

apeforest

LGTM. Thanks a lot!

* Changed dtype for data & gradient dimensions * Add nightly test

connorgoggins added 2 commits February 19, 2020 15:52

Changed dtype for data & gradient dimensions

cae1838

Add nightly test

b38fdf0

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 20, 2020

apeforest approved these changes Feb 20, 2020

View reviewed changes

apeforest merged commit 5486828 into apache:master Feb 20, 2020

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

[Large Tensor] Fixed SoftmaxActivation op (apache#17634)

4be5fff

* Changed dtype for data & gradient dimensions * Add nightly test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Large Tensor] Fixed SoftmaxActivation op #17634

[Large Tensor] Fixed SoftmaxActivation op #17634

connorgoggins commented Feb 20, 2020 •

edited

Loading

connorgoggins commented Feb 20, 2020

apeforest left a comment

[Large Tensor] Fixed SoftmaxActivation op #17634

[Large Tensor] Fixed SoftmaxActivation op #17634

Conversation

connorgoggins commented Feb 20, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Results

connorgoggins commented Feb 20, 2020

apeforest left a comment

Choose a reason for hiding this comment

connorgoggins commented Feb 20, 2020 •

edited

Loading