[MXNET-1450] Improve the backward mirroring implementation #18228

ArmageddonKnight · 2020-05-04T04:43:46Z

Description

This PR improves the backward mirroring implementation. Specifically, it takes into account for each (group of) operator node whether doing backward mirroring can be truly benefitial to the total memory footprint (please refer to test case #1 and #2 below). It also considers the data dependencies between the forward node and its corresponding gradient node. This is because it is possible for the feature maps of a layer to be recomputed without recomputing the layer itself (e.g., the Fully-Connected layer, test case #3). Those improvements allow us to further optimize the memory consumption of our DNN training models.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Backward Mirroring Improvements
Test Case Add some ops #1: RNN Cell
- In the following graphs, we will be using red arrows to denote backward dependencies (a.k.a. feature maps), which usually are the main contributor of the memory consumption.
- In the example below that mimics an RNN cell, we shall NOT be doing backward mirroring, because otherwise the total amount of feature maps storage will be doubled.

Test Case [concurrent-blocking-queue-fix] ConcurrentBlockingQueue::Pop's return… #2: MLP Attention
- In the example below that mimics the MLP attention (a.k.a. additive attention), we shall be doing backward mirroring, because it can help reduce the total maount of feature maps storage from O(T^2 N) to O(2T N).

Test Case Update dev branch #3: Fully-Connected Layer
- In the example below that uses a red node to denote a compute-heavy layer whose gradients are not dependent on the output data entries (e.g., the Fully-Connected layer). The red node can also be put on the mirror path. This enables us to relieve the backward dependency on its feature maps (i.e., inputs) without incurring significant performance overhead.

Comments

FYI, @eric-haibin-lin @szha

mxnet-bot · 2020-05-04T04:43:48Z

Hey @ArmageddonKnight , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, edge, unix-cpu, miscellaneous, unix-gpu, windows-gpu, sanity, windows-cpu, clang, centos-cpu, centos-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

docs/static_site/src/pages/api/faq/env_var.md

src/executor/graph_executor.cc

ArmageddonKnight · 2020-05-10T03:37:11Z

@mxnet-bot run ci [centos-gpu]

mxnet-bot · 2020-05-10T03:37:19Z

Jenkins CI successfully triggered : [centos-gpu]

apeforest

LGTM. Thanks

docs/static_site/src/pages/api/faq/env_var.md

python/mxnet/rnn/rnn_cell.py

eric-haibin-lin

@ArmageddonKnight would you mind sharing some performance result with this feature enabled?

ArmageddonKnight · 2020-05-19T22:58:56Z

@ArmageddonKnight would you mind sharing some performance result with this feature enabled?

@eric-haibin-lin According to our evaluation on a single machine with RTX 2080 Ti, the performance overhead of training ResNet-152 with a batch size of 152 is 6%.

sxjscience · 2020-05-21T16:50:15Z

Is there a way to use it for Gluon?

ArmageddonKnight · 2020-05-26T04:40:46Z

Hi @sxjscience , sorry for the late reply. It is possible, but because the current Gluon backend does not have mirroring involved, as can be seen here:

https://github.com/apache/incubator-mxnet/blob/3efacd27f75e38e06151675407b0f17e3c1891a5/src/imperative/cached_op.h#L168-L171

Enabling backward mirroring currently has no effect on Gluon.

ArmageddonKnight requested review from aaronmarkham and szha as code owners May 4, 2020 04:43

ArmageddonKnight force-pushed the bojian/Echo-Contrib branch 2 times, most recently from 2509aaf to ade545d Compare May 4, 2020 07:32

leezu requested a review from eric-haibin-lin May 4, 2020 18:35

eric-haibin-lin self-assigned this May 4, 2020

apeforest self-assigned this May 5, 2020

apeforest reviewed May 5, 2020

View reviewed changes

docs/static_site/src/pages/api/faq/env_var.md Outdated Show resolved Hide resolved

docs/static_site/src/pages/api/faq/env_var.md Outdated Show resolved Hide resolved

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

ArmageddonKnight requested a review from marcoabreu as a code owner May 6, 2020 04:54

ArmageddonKnight force-pushed the bojian/Echo-Contrib branch 4 times, most recently from 6ab28ea to 0e94405 Compare May 9, 2020 22:33

ArmageddonKnight requested a review from apeforest May 9, 2020 22:34

ArmageddonKnight force-pushed the bojian/Echo-Contrib branch from 8c626df to 55cde60 Compare May 10, 2020 01:08

Improve the backward mirroring implementation

8f5bdd7

ArmageddonKnight force-pushed the bojian/Echo-Contrib branch from ce933ca to 8f5bdd7 Compare May 10, 2020 01:20

apeforest approved these changes May 11, 2020

View reviewed changes

eric-haibin-lin reviewed May 14, 2020

View reviewed changes

docs/static_site/src/pages/api/faq/env_var.md Show resolved Hide resolved

python/mxnet/rnn/rnn_cell.py Show resolved Hide resolved

eric-haibin-lin reviewed May 15, 2020

View reviewed changes

eric-haibin-lin approved these changes May 21, 2020

View reviewed changes

eric-haibin-lin merged commit 4827de8 into apache:master May 21, 2020

ArmageddonKnight deleted the bojian/Echo-Contrib branch May 26, 2020 04:33

sxjscience mentioned this pull request Jun 12, 2020

Gradient checkpointing in the Gluon interface #18543

Open

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020

Improve the backward mirroring implementation (apache#18228)

8cbe549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-1450] Improve the backward mirroring implementation #18228

[MXNET-1450] Improve the backward mirroring implementation #18228

ArmageddonKnight commented May 4, 2020 •

edited

Loading

mxnet-bot commented May 4, 2020

ArmageddonKnight commented May 10, 2020

mxnet-bot commented May 10, 2020

apeforest left a comment

eric-haibin-lin left a comment

ArmageddonKnight commented May 19, 2020 •

edited

Loading

sxjscience commented May 21, 2020

ArmageddonKnight commented May 26, 2020

[MXNET-1450] Improve the backward mirroring implementation #18228

[MXNET-1450] Improve the backward mirroring implementation #18228

Conversation

ArmageddonKnight commented May 4, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented May 4, 2020

ArmageddonKnight commented May 10, 2020

mxnet-bot commented May 10, 2020

apeforest left a comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

ArmageddonKnight commented May 19, 2020 • edited Loading

sxjscience commented May 21, 2020

ArmageddonKnight commented May 26, 2020

ArmageddonKnight commented May 4, 2020 •

edited

Loading

ArmageddonKnight commented May 19, 2020 •

edited

Loading