Make BERT-GPU deploy compatible with MXNet 1.8 #1389

MoisesHer · 2020-10-14T01:26:15Z

Description

Change custom graph pass implementation to make it compatible with MXNet 1.8
Solving issue #1388

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Change custom graph pass to support both MXNet 1.7 & MXNet 1.8
Change setup and deploy scripts accordingly
Activate CUDA Graphs for MXNet 1.8 (> 30% speedup with small batch sizes)

cc @dmlc/gluon-nlp-team, @samskalicky, @Kh4L

mli · 2020-10-14T01:58:14Z

Job PR-1389/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/1/index.html

samskalicky · 2020-10-14T17:32:42Z

scripts/bert/bertpass_gpu.cc

@@ -30,6 +30,7 @@
 #include <functional>
 #include "mxnet/lib_api.h"

+#if MXNET_1_7


If you want you can use the defines in lib_api.h to enable the appropriate code paths:
MXNet 1.7: #define MX_LIBRARY_VERSION 7
MXNet 1.8: #define MX_LIBRARY_VERSION 10

samskalicky · 2020-10-14T17:36:30Z

scripts/bert/bertpass_gpu.cc

+      Node* node_expand_1_bias = g->addNode(base_name + "_expand_1_bias", "expand_dims");
+      Node* node_expand_2_bias = g->addNode(base_name + "_expand_2_bias", "expand_dims");
+      Node* node_bcst_like = g->addNode(base_name + "_broadcast_like", "broadcast_like");
+      Node* node_add_bias = g->addNode(base_name + "_add_bias", "elemwise_add");


Its nice to see that from 1.7 --> 1.8 we were able to condense the number of lines of code users need to write from 16 down to 4! 👍

samskalicky · 2020-10-14T17:38:12Z

scripts/bert/bertpass_gpu.cc

+      node_expand_1_bias->attrs["axis"]="0";
+      node_expand_1_bias->inputs.resize(1);
+      node_expand_1_bias->inputs[0].node = node_ffn1_bias;
+      node_expand_1_bias->inputs[0].entry = 0;


if you want to make the code a bit more succinct you could change:

node_expand_1_bias->inputs.resize(1); node_expand_1_bias->inputs[0].node = node_ffn1_bias; node_expand_1_bias->inputs[0].entry = 0;

to:

node_expand_1_bias->inputs.emplace_back(node_ffn1_bias, 0);

thanks for the suggestions. the emplace_back is giving me some problems:

bertpass_gpu.cc:290:64: required from here /usr/include/c++/7/ext/new_allocator.h:136:4: error: new initializer expression list treated as compound expression [-fpermissive] { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); } ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/include/c++/7/ext/new_allocator.h:136:4: error: no matching function for call to ‘mxnet::ext::NodeEntry::NodeEntry(int)’ In file included from /gluon-nlp/scripts/bert/bertpass_gpu.cc:31:0: /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: mxnet::ext::NodeEntry::NodeEntry() struct NodeEntry { ^~~~~~~~~ /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate expects 0 arguments, 1 provided /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: constexpr mxnet::ext::NodeEntry::NodeEntry(const mxnet::ext::NodeEntry&) /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: no known conversion for argument 1 from ‘int’ to ‘const mxnet::ext::NodeEntry&’ /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: constexpr mxnet::ext::NodeEntry::NodeEntry(mxnet::ext::NodeEntry&&) /incubator-mxnet/include/mxnet/lib_api.h:549:8: note: no known conversion for argument 1 from ‘int’ to ‘mxnet::ext::NodeEntry&&

samskalicky · 2020-10-14T17:43:46Z

scripts/bert/bertpass_gpu.cc

 REGISTER_PASS(custom_pass)
 .setBody(custom_pass);

 MXReturnValue initialize(int version) {
-  if (version >= 10400) {
+  printf("VERSION %i\n", version);


do you want to print the version twice?

samskalicky · 2020-10-14T17:46:36Z

scripts/bert/deploy.py

@@ -321,17 +321,29 @@ def export(prefix):
        arg_array['data2'] = mx.nd.ones((test_batch_size, ), dtype='float32')


@Kh4L didnt you make a change in 1.8 that enables not having to set the inputs like this?
@MoisesHer is setting the shapes/types still not working when using shape_dict?

yeap, If I set shapes/types, I am receiving a Segmentation fault: 11 as soon as I call sym.optimize_for

mli · 2020-10-16T21:29:17Z

Job PR-1389/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/2/index.html

MoisesHer · 2020-10-16T22:37:01Z

I am getting the following error, not sure why:

[2020-10-16T20:59:59.298Z] ERROR: Could not find a version that satisfies the requirement scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34)) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2.post1)
[2020-10-16T20:59:59.298Z] ERROR: No matching distribution found for scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34))
[2020-10-16T20:59:59.298Z] CondaEnvException: Pip failed

mli · 2020-11-12T18:09:08Z

Job PR-1389/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/3/index.html

TristonC · 2021-01-07T17:36:09Z

@szha Could we have someone help for the CI failure?

mli · 2021-02-04T18:13:05Z

Job PR-1389/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/4/index.html

mli · 2021-02-04T21:55:20Z

Job PR-1389/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/5/index.html

mli · 2021-02-04T22:09:21Z

Job PR-1389/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/6/index.html

mli · 2021-02-04T22:40:18Z

Job PR-1389/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/7/index.html

mli · 2021-02-04T23:20:40Z

Job PR-1389/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/9/index.html

mli · 2021-02-05T00:44:49Z

Job PR-1389/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/10/index.html

MoisesHer · 2021-02-05T16:05:47Z

@szha @TristonC
This will never pass the CI tests unless some of the wheels on https://dist.mxnet.io/python include mxnet-build/src/lib_api.cc file.

https://github.com/apache/incubator-mxnet/blame/2fc0706874531fdfdbe49819eae0c88f8016eee3/tools/pip/setup.py#L108

szha · 2021-02-05T16:43:08Z

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

MoisesHer · 2021-02-05T17:19:29Z

thanks @szha ,
Do you mean to included the .cc as part of this PR within Gluon-NLP?

szha · 2021-02-05T17:20:25Z

Yes, I think that would be the fastest way to get unblocked on this PR.

samskalicky · 2021-02-05T18:14:04Z

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

im confused, we already had this PR apache/mxnet#19393 to add the file in the pip wheel. After this @MoisesHer was able to get this working (hence the PR). But why is it not working all of a sudden, did something else change?

@MoisesHer are you using a diff pip wheel than before? Is that wheel built differently?

mli · 2021-02-05T18:15:38Z

Job PR-1389/11 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/11/index.html

MoisesHer · 2021-02-05T18:22:06Z

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling,
but I think we never included the lib_api.cc within a wheel

samskalicky · 2021-02-05T18:37:48Z

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling,
but I think we never included the lib_api.cc within a wheel

i added this line:
https://github.com/apache/incubator-mxnet/blob/v1.8.x/tools/pip/setup.py#L108

As part of the apache/mxnet#19393 for this specific reason (after our offline discussion on Slack). I remember building the wheel and making sure the file was there.

Maybe i built the wheel differently in my testing than the aws-mx one (or the ones on dist.mxnet.io).

Either way, ill work with @szha on apache/mxnet#19850 and we'll backport as necessary

szha · 2021-02-05T18:39:58Z

@samskalicky the problem is that the files will only be included according to the manifest file. The if conditions in setup.py only copies the file in the location of packaging.

MoisesHer · 2021-02-05T19:52:57Z

@szha apart of the .cc file, the remaining issue is not related to this PR:

./scripts/ner/finetune_bert.py", line 34, in <module>
[2021-02-05T19:30:49.978Z]     import seqeval.metrics
...
File "./scripts/intent_cls_slot_labeling/finetune_icsl.py", line 39, in <module>
[2021-02-05T19:48:32.411Z]     from seqeval.metrics import f1_score as ner_f1_score
...
AttributeError: module 'enum' has no attribute 'Flag'

szha · 2021-02-05T20:35:36Z

@MoisesHer I see. I think the CI needs an update to the python versions. Working on it.

szha · 2021-02-05T22:33:47Z

I finished updating the conda versions on all CI hosts (as well as the docker image). Will babysit this PR.

mli · 2021-02-10T17:41:09Z

Job PR-1389/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/12/index.html

szha · 2021-02-10T20:03:39Z

Not sure why python3.5 is still picked up. I'm looking into the CI.

MoisesHer · 2021-03-04T21:41:04Z

Hi @szha, do you have any update? thanks

szha · 2021-03-04T21:44:29Z

We will switch the 0.x CI to github actions after #1531

codecov · 2021-03-21T02:40:38Z

Codecov Report

Merging #1389 (309351d) into v0.10.x (a3ba807) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           v0.10.x    #1389      +/-   ##
===========================================
- Coverage    33.44%   33.43%   -0.02%     
===========================================
  Files          155      155              
  Lines        15206    15213       +7     
===========================================
  Hits          5086     5086              
- Misses       10120    10127       +7

Impacted Files	Coverage Δ
scripts/bert/deploy.py	`0.00% <0.00%> (ø)`
scripts/bert/setup.py	`0.00% <0.00%> (ø)`

github-actions · 2021-03-21T03:37:49Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1389/309351d3983077bac905d9df4f4575594f1633b5/index.html

szha · 2021-03-21T14:42:51Z

@MoisesHer thanks for the change. Would you also port the change to v0.x?

MoisesHer requested a review from a team as a code owner October 14, 2020 01:26

samskalicky reviewed Oct 14, 2020

View reviewed changes

MoisesHer and others added 9 commits March 20, 2021 22:30

Make BERT-GPU deploy compatible with MXNet 1.8

9a614c0

Activate CUDA Graphs and clean up

6c012c3

debugging lib_api.cc path

17446f4

fix lint

656145e

fix lint

2ebf53f

fix lint

2086998

debugging lib_api.cc

7cea97c

include lib_api.cc within GluonNLP

3826e0e

remove debugging prints

309351d

szha force-pushed the deploy_BERT_MXNet_1.8 branch from ee1a1db to 309351d Compare March 21, 2021 02:31

szha approved these changes Mar 21, 2021

View reviewed changes

szha merged commit eed42b4 into dmlc:v0.10.x Mar 21, 2021

		@@ -321,17 +321,29 @@ def export(prefix):
		arg_array['data2'] = mx.nd.ones((test_batch_size, ), dtype='float32')

Make BERT-GPU deploy compatible with MXNet 1.8 #1389

Make BERT-GPU deploy compatible with MXNet 1.8 #1389

Conversation

MoisesHer commented Oct 14, 2020 • edited Loading

Description

Checklist

Essentials

Changes

mli commented Oct 14, 2020

samskalicky Oct 14, 2020

Choose a reason for hiding this comment

samskalicky Oct 14, 2020

Choose a reason for hiding this comment

samskalicky Oct 14, 2020 • edited Loading

Choose a reason for hiding this comment

MoisesHer Oct 16, 2020

Choose a reason for hiding this comment

samskalicky Oct 14, 2020

Choose a reason for hiding this comment

samskalicky Oct 14, 2020

Choose a reason for hiding this comment

MoisesHer Oct 14, 2020

Choose a reason for hiding this comment

mli commented Oct 16, 2020

MoisesHer commented Oct 16, 2020

mli commented Nov 12, 2020

TristonC commented Jan 7, 2021

mli commented Feb 4, 2021

mli commented Feb 4, 2021

mli commented Feb 4, 2021

mli commented Feb 4, 2021

mli commented Feb 4, 2021

mli commented Feb 5, 2021

MoisesHer commented Feb 5, 2021

szha commented Feb 5, 2021 • edited Loading

MoisesHer commented Feb 5, 2021

szha commented Feb 5, 2021

samskalicky commented Feb 5, 2021

mli commented Feb 5, 2021

MoisesHer commented Feb 5, 2021

samskalicky commented Feb 5, 2021

szha commented Feb 5, 2021

MoisesHer commented Feb 5, 2021 • edited Loading

szha commented Feb 5, 2021

szha commented Feb 5, 2021 • edited Loading

mli commented Feb 10, 2021

szha commented Feb 10, 2021

MoisesHer commented Mar 4, 2021

szha commented Mar 4, 2021

codecov bot commented Mar 21, 2021 • edited Loading

Codecov Report

github-actions bot commented Mar 21, 2021

szha commented Mar 21, 2021

MoisesHer commented Oct 14, 2020 •

edited

Loading

samskalicky Oct 14, 2020 •

edited

Loading

szha commented Feb 5, 2021 •

edited

Loading

MoisesHer commented Feb 5, 2021 •

edited

Loading

szha commented Feb 5, 2021 •

edited

Loading

codecov bot commented Mar 21, 2021 •

edited

Loading