MXNet AMP (automatic mixed precision) #14173

ptrendx · 2019-02-15T03:55:16Z

Description

Whis is a Work in Progress PR for AMP (automatic mixed precision) support for MXNet, similar to pyTorch version found in https://github.com/NVIDIA/apex.

This PR relies on multiple other PRs and bug fixes, listed in Comments section.

Dynamic loss scaling part done by @Caenorst (commits were squashed for easier rebasing).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Auditor that enables/disables operations to be done in FP16 automatically. It is implemented via patching MXNet functions in mxnet.symbol and mxnet.ndarray to insert casts to FP16/FP32 where necessary.
Operator amp_cast and amp_multicast that handle casting between FP16/FP32 when necessary and do not change other types. They are optimized to not do anything if the input is already in the proper type.
Dynamic loss scaling and supporting operators for checking gradients for infs/NaNs and skipping update step if such value is encountered.

Comments

This PR relies on multiple other PRs/bug fixes:
- Fix shape inference pass #14153
- Fix the FInplaceIdentity tvm#2572 - once this is done, need to change submodule to point to dmlc/tvm again
- Relaxing type requirements for slice_like op #14097
- Change to Parameter handling breaks MXNet tests dmlc/dmlc-core#503
- Broken test test_io.test_CSVIter #12139 which was masked by Drop dmlc-core commits that break MXNet build #12189, but which prevents updating to newer dmlc-core
- Fix SliceLikeBackward #14209 (without this fix SSD from GluonCV does not work)
- Relaxing type requirements for reshape_like op #14325
- Limit workspace for cudnnGet results #14326
- Correct update count with Gluon trainer and update_on_kvstore=False #14377

FYI @eric-haibin-lin @szha

Added PoC of dynamic loss scaling

eric-haibin-lin · 2019-02-15T06:24:02Z

.gitmodules

@@ -25,7 +25,7 @@
 	url = https://github.com/dmlc/cub
 [submodule "3rdparty/tvm"]


If we need to upgrade to the latest commit of tvm (which depends on it own version of dmlc-core), it is necessary to upgrade the dmlc-core submodule in mxnet?

a. any nnvm header that is included in MXNet .cc file will use the MXNet's dmlc-core
b. I'm not sure if the make of NNVM is done using its own dmlc-core - it is possible that the env is set up so all components use the same (MXNet's) dmlc-core. Otherwise you could end up with some binary incompatibilities (e.g. if dmlc-core changes some struct and it is passed around between NNVM and MXNet you don't want it to be interpreted differently).

tvm now comes with a notice file, so we'd need to add that to mxnet's notice too per apache license.

ankkhedia · 2019-02-15T18:49:15Z

@ptrendx Thanks for the contribution!

@mxnet-label-bot add [pr-work-in-progress ]

python/mxnet/amp/amp.py

ZhennanQin · 2019-02-17T02:11:31Z

python/mxnet/amp/amp.py

+                if (cond_arg[0] not in kwargs or
+                        kwargs[cond_arg[0]] not in cond_arg[1]):
+                    return f(*args, **kwargs)
+            new_args = list(map(lambda x: _cast_symbol_NDArray(x, target_dtype), args))


If an fp16 output is used for multiple fp32 operators, can we cast the fp16 only once?

That would require a graph pass approach instead of simple function substitution.

This will degrade the performance a lot so I suggest using the graph pass.

Adding a graph pass may be the next step for performance tuning but it definitely is not necessary for the feature to be useful. I tested all the models from GluonCV and in none of them I saw a need for it (the unnecessary casts typically take <1% of the total time).

Sounds good :) Could you share your data and the plan as my comments in the below?

pengzhao-intel · 2019-02-27T08:19:33Z

@ptrendx @eric-haibin-lin could you share the total picture (or methodology/sw stack) of AMP integration? I'd like to understand in the high level before going to details.

anirudh2290 · 2019-05-16T07:57:46Z

python/mxnet/contrib/amp/amp.py

+    if not _amp_initialized:
+        _amp_initialized = True
+        logging.info("Using AMP")
+        target_dtype = np.dtype(target_dtype)


assert that target_dtype is float16

anirudh2290 · 2019-05-16T07:58:52Z

python/mxnet/contrib/amp/lists/symbol.py

+"""Lists of functions whitelisted/blacklisted for automatic mixed precision in symbol API."""
+
+# Functions that should be cast to lower precision
+TARGET_DTYPE_FUNCS = [


This can still be called FP16_FUNCS and FP16_FP32_FUNCS right ? If bfloat16 support is added in future there will be additional lists added here.

In the previous comment you said "change the lists to have target_dtype_list, target_dtype_fp32_list." and that is why I did that change. I can revert it - either way is fine by me

Apologies for not being clear. I meant <target_dtype>_fp32_list

ok, will change

anirudh2290 · 2019-05-16T08:17:36Z

python/mxnet/contrib/amp/amp.py

+            aux = sym.list_auxiliary_states()
+            inputs = list(map(lambda x: _cast_symbol_NDArray(x, target_dtype)
+                              if x.name not in aux else x, inputs))
+            atomic_sym = sym._gen_atomic_symbol()


why is this needed ?

Basically this allows us to cast inputs to the op that were not specified by the user (e.g. you make a convolution via symbolic API, you don't need to pass weights and biases to it, MXNet will generate them). So what we do here is create a symbol using the original function (which will create all the other children), then take those children and cast them. MXNet (nor NNVM) does not allow, however, to manipulate a symbol children after it was created. So what we do is we create a new, atomic symbol (atomic means it does not have any children set yet, but has all the params the same as the symbol used to create it), and populate that symbol with casted inputs.

when pretrained models are loaded under amp.init will it silently fail ?

Pretrained as in symbol? That will not do anything (it needs your PR).
Pretrained as in load_parameters? That will work as expected (I was using it for all my experiments with GluonCV where e.g. SSD uses pretrained RN50 backbone).

anirudh2290

Overall this change LGTM. This will be really useful to our users. Thanks a lot for your effort @ptrendx!

tests checking that AMP lists contain only existing ops

because of functions being available only in specific configurations

eric-haibin-lin

Can we have unit tests for loss scaler, multi_cast and isfinite ops to make sure they're not broken in the future?

python/mxnet/model.py

python/mxnet/symbol/symbol.py

* Beginning of AMP * Optimize noop cast * More operations added * Backward cast * Adding AMPCast and AMPMultiCast * Fix some of lint * Changed symbol wrapper to handle hidden inputs Added PoC of dynamic loss scaling * Moved back to dmlc/tvm repo * fix counter reset to increase loss scale every 2k iterations * Fix indentation * Add contrib from symbol and ndarray to symbol list * Adding where to widest type cast * Do not cast in imperative mode on CPU context * Update dmlc-core to fix unittests * Fix wrapper metadata, fix self handling * Blacklist sync batchnorm (since its implementation is FP32 only) * Fix lint * Enable losses to be tuple * Get rid of AMP handle * Add scaling to Output functions * Fix pylint * Update dmlc-core * Changing prints in AMP to logging.info * NNVM -> MXNet for FInferShape * Bring the inplaceidentity fix to copied pass from NNVM * Added tutorial for AMP * Making Windows compiler happy * Fixes to tutorial * More fixes * Fix lint * Fix * Add amp/index.md to whitelist for tutorial tests * Whitelisting cuDNN RNN * Manual unscale * _internal functions wrapping * Make SymbolFunctor from Symbol * Fix the type infer function of AMP multicast * Added ability to override casting lists * Making clang-tidy and pylint happy * More cleaning * Making clang-tidy really happy * remove amp_cast and amp_multicast before saving the model * Changes from review * Add RemoveAmpCast in a separate c_api function, add the option in symbol.save * add remove_amp_cast option (True by default) to everyway of saving symbol * Fix * First stab at adding the gray list * More ops added * Adding the rest of the functions * Improvements to AMP test * Changing of names and changing wrapping * Moving to contrib * Modifying tutorial for contrib AMP * Removing non existent functions * Fix import in test * Fix lint * Added new functions * Added assert * Fix the unknown ndim in PlanMemory pass * Moving back to FP16_FUNCS and FP16_FP32_FUNCS * Removing unnecessary ops * Adding ops that exist only in some build configurations and removing tests checking that AMP lists contain only existing ops * Removing warning when not every function was found during AMP init because of functions being available only in specific configurations * Add tests and doc * Fix the CPU version of all_finite * Adding test cases for all_finite operator * Add new operators * Fix

ptrendx added 7 commits February 5, 2019 15:51

Beginning of AMP

0081522

Optimize noop cast

4aac1f6

More operations added

8dde479

Backward cast

a2b9520

Adding AMPCast and AMPMultiCast

fcfcaa9

Fix some of lint

c310051

Changed symbol wrapper to handle hidden inputs

a12d6c1

Added PoC of dynamic loss scaling

ptrendx requested a review from szha as a code owner February 15, 2019 03:55

eric-haibin-lin reviewed Feb 15, 2019

View reviewed changes

marcoabreu added the pr-work-in-progress PR is still work in progress label Feb 15, 2019

ZhennanQin reviewed Feb 17, 2019

View reviewed changes

python/mxnet/amp/amp.py Outdated Show resolved Hide resolved

ZhennanQin reviewed Feb 17, 2019

View reviewed changes

python/mxnet/amp/amp.py Outdated Show resolved Hide resolved

ZhennanQin reviewed Feb 17, 2019

View reviewed changes

ptrendx and others added 5 commits February 19, 2019 14:43

Moved back to dmlc/tvm repo

f5b398d

fix counter reset to increase loss scale every 2k iterations

5a3c74c

Fix indentation

e446a9c

Add contrib from symbol and ndarray to symbol list

2bb730e

Adding where to widest type cast

50bdf9b

ptrendx added 2 commits March 4, 2019 11:25

Do not cast in imperative mode on CPU context

d000edf

Update dmlc-core to fix unittests

d6464c6

ptrendx force-pushed the pr_amp branch from f7b06b8 to d6464c6 Compare March 4, 2019 22:28

ptrendx added 7 commits March 4, 2019 16:30

Fix wrapper metadata, fix self handling

6557809

Blacklist sync batchnorm (since its implementation is FP32 only)

da9dca2

Fix lint

b3cd26e

Enable losses to be tuple

efb72a0

Get rid of AMP handle

6e5d74a

Add scaling to Output functions

bcf6cd6

Fix pylint

dc06875

ptrendx added 8 commits May 15, 2019 15:59

Improvements to AMP test

3112a80

Changing of names and changing wrapping

8621db4

Moving to contrib

725a7b3

Modifying tutorial for contrib AMP

4bd2b71

Removing non existent functions

27ffacc

Merge branch 'upstream' into pr_amp

ccfecb6

Fix import in test

476cbc9

Fix lint

f4d4676

anirudh2290 reviewed May 16, 2019

View reviewed changes

ptrendx added 5 commits May 16, 2019 10:23

Added new functions

bdd1966

Added assert

ae2fb6a

Merge branch 'upstream' into pr_amp

e7af5c4

Fix the unknown ndim in PlanMemory pass

b9ab099

Moving back to FP16_FUNCS and FP16_FP32_FUNCS

c86f98f

anirudh2290 approved these changes May 16, 2019

View reviewed changes

ptrendx added 2 commits May 16, 2019 17:14

Removing unnecessary ops

3cc3253

Adding ops that exist only in some build configurations and removing

69f2ec3

tests checking that AMP lists contain only existing ops

ptrendx changed the title ~~[WIP] MXNet AMP (automatic mixed precision)~~ MXNet AMP (automatic mixed precision) May 17, 2019

Removing warning when not every function was found during AMP init

c0db89b

because of functions being available only in specific configurations

eric-haibin-lin reviewed May 17, 2019

View reviewed changes

python/mxnet/model.py Show resolved Hide resolved

python/mxnet/model.py Show resolved Hide resolved

python/mxnet/symbol/symbol.py Show resolved Hide resolved

anirudh2290 and others added 6 commits May 20, 2019 11:34

Add tests and doc

4d302a5

Fix the CPU version of all_finite

3e1e248

Adding test cases for all_finite operator

104d03a

Merge branch 'upstream' into pr_amp

bf70e0c

Add new operators

ffb99b1

Fix

af7c461

eric-haibin-lin approved these changes May 20, 2019

View reviewed changes

anirudh2290 merged commit 5bc08ce into apache:master May 21, 2019

roywei mentioned this pull request May 21, 2019

[CI][NightlyTestsForBinaries] nightly test failed on new AMP tutorial #15028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNet AMP (automatic mixed precision) #14173

MXNet AMP (automatic mixed precision) #14173

ptrendx commented Feb 15, 2019 •

edited

Loading

eric-haibin-lin Feb 15, 2019

ptrendx Feb 15, 2019

szha Feb 19, 2019

ankkhedia commented Feb 15, 2019

ZhennanQin Feb 17, 2019

ptrendx Feb 19, 2019

pengzhao-intel Mar 22, 2019

ptrendx Mar 22, 2019

pengzhao-intel Mar 22, 2019

pengzhao-intel commented Feb 27, 2019

anirudh2290 May 16, 2019

ptrendx May 16, 2019

anirudh2290 May 16, 2019

ptrendx May 16, 2019

anirudh2290 May 16, 2019

ptrendx May 16, 2019

anirudh2290 May 16, 2019

ptrendx May 16, 2019

anirudh2290 May 16, 2019

ptrendx May 17, 2019

anirudh2290 left a comment

eric-haibin-lin left a comment

		@@ -25,7 +25,7 @@
		url = https://github.com/dmlc/cub
		[submodule "3rdparty/tvm"]

MXNet AMP (automatic mixed precision) #14173

MXNet AMP (automatic mixed precision) #14173

Conversation

ptrendx commented Feb 15, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankkhedia commented Feb 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented Feb 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudh2290 left a comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

ptrendx commented Feb 15, 2019 •

edited

Loading