Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Make BERT-GPU deploy compatible with MXNet 1.8 #1389

Merged
merged 9 commits into from
Mar 21, 2021

Conversation

MoisesHer
Copy link
Contributor

@MoisesHer MoisesHer commented Oct 14, 2020

Description

Change custom graph pass implementation to make it compatible with MXNet 1.8
Solving issue #1388

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Change custom graph pass to support both MXNet 1.7 & MXNet 1.8
  • Change setup and deploy scripts accordingly
  • Activate CUDA Graphs for MXNet 1.8 (> 30% speedup with small batch sizes)

cc @dmlc/gluon-nlp-team, @samskalicky, @Kh4L

@MoisesHer MoisesHer requested a review from a team as a code owner October 14, 2020 01:26
@mli
Copy link
Member

mli commented Oct 14, 2020

Job PR-1389/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/1/index.html

@@ -30,6 +30,7 @@
#include <functional>
#include "mxnet/lib_api.h"

#if MXNET_1_7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can use the defines in lib_api.h to enable the appropriate code paths:
MXNet 1.7: #define MX_LIBRARY_VERSION 7
MXNet 1.8: #define MX_LIBRARY_VERSION 10

Node* node_expand_1_bias = g->addNode(base_name + "_expand_1_bias", "expand_dims");
Node* node_expand_2_bias = g->addNode(base_name + "_expand_2_bias", "expand_dims");
Node* node_bcst_like = g->addNode(base_name + "_broadcast_like", "broadcast_like");
Node* node_add_bias = g->addNode(base_name + "_add_bias", "elemwise_add");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its nice to see that from 1.7 --> 1.8 we were able to condense the number of lines of code users need to write from 16 down to 4! 👍

node_expand_1_bias->attrs["axis"]="0";
node_expand_1_bias->inputs.resize(1);
node_expand_1_bias->inputs[0].node = node_ffn1_bias;
node_expand_1_bias->inputs[0].entry = 0;
Copy link

@samskalicky samskalicky Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to make the code a bit more succinct you could change:

      node_expand_1_bias->inputs.resize(1);
      node_expand_1_bias->inputs[0].node = node_ffn1_bias;
      node_expand_1_bias->inputs[0].entry = 0;

to:

      node_expand_1_bias->inputs.emplace_back(node_ffn1_bias, 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestions. the emplace_back is giving me some problems:

bertpass_gpu.cc:290:64:   required from here
/usr/include/c++/7/ext/new_allocator.h:136:4: error: new initializer expression list treated as compound expression [-fpermissive]
  { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/7/ext/new_allocator.h:136:4: error: no matching function for call to ‘mxnet::ext::NodeEntry::NodeEntry(int)’
In file included from /gluon-nlp/scripts/bert/bertpass_gpu.cc:31:0:
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: mxnet::ext::NodeEntry::NodeEntry()
 struct NodeEntry {
        ^~~~~~~~~
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note:   candidate expects 0 arguments, 1 provided
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: constexpr mxnet::ext::NodeEntry::NodeEntry(const mxnet::ext::NodeEntry&)
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note:   no known conversion for argument 1 from ‘int’ to ‘const mxnet::ext::NodeEntry&’
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note: candidate: constexpr mxnet::ext::NodeEntry::NodeEntry(mxnet::ext::NodeEntry&&)
/incubator-mxnet/include/mxnet/lib_api.h:549:8: note:   no known conversion for argument 1 from ‘int’ to ‘mxnet::ext::NodeEntry&&

REGISTER_PASS(custom_pass)
.setBody(custom_pass);

MXReturnValue initialize(int version) {
if (version >= 10400) {
printf("VERSION %i\n", version);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to print the version twice?

@@ -321,17 +321,29 @@ def export(prefix):
arg_array['data2'] = mx.nd.ones((test_batch_size, ), dtype='float32')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kh4L didnt you make a change in 1.8 that enables not having to set the inputs like this?
@MoisesHer is setting the shapes/types still not working when using shape_dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, If I set shapes/types, I am receiving a Segmentation fault: 11 as soon as I call sym.optimize_for

@mli
Copy link
Member

mli commented Oct 16, 2020

Job PR-1389/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/2/index.html

@MoisesHer
Copy link
Contributor Author

I am getting the following error, not sure why:

[2020-10-16T20:59:59.298Z] ERROR: Could not find a version that satisfies the requirement scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34)) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2.post1)
[2020-10-16T20:59:59.298Z] ERROR: No matching distribution found for scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34))
[2020-10-16T20:59:59.298Z] CondaEnvException: Pip failed

@mli
Copy link
Member

mli commented Nov 12, 2020

Job PR-1389/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/3/index.html

@TristonC
Copy link

TristonC commented Jan 7, 2021

@szha Could we have someone help for the CI failure?

@mli
Copy link
Member

mli commented Feb 4, 2021

Job PR-1389/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/4/index.html

@mli
Copy link
Member

mli commented Feb 4, 2021

Job PR-1389/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/5/index.html

@mli
Copy link
Member

mli commented Feb 4, 2021

Job PR-1389/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/6/index.html

@mli
Copy link
Member

mli commented Feb 4, 2021

Job PR-1389/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/7/index.html

@mli
Copy link
Member

mli commented Feb 4, 2021

Job PR-1389/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/9/index.html

@mli
Copy link
Member

mli commented Feb 5, 2021

Job PR-1389/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/10/index.html

@MoisesHer
Copy link
Contributor Author

@szha @TristonC
This will never pass the CI tests unless some of the wheels on https://dist.mxnet.io/python include mxnet-build/src/lib_api.cc file.

https://github.com/apache/incubator-mxnet/blame/2fc0706874531fdfdbe49819eae0c88f8016eee3/tools/pip/setup.py#L108

@szha
Copy link
Member

szha commented Feb 5, 2021

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

@MoisesHer
Copy link
Contributor Author

thanks @szha ,
Do you mean to included the .cc as part of this PR within Gluon-NLP?

@szha
Copy link
Member

szha commented Feb 5, 2021

Yes, I think that would be the fastest way to get unblocked on this PR.

@samskalicky
Copy link

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

im confused, we already had this PR apache/mxnet#19393 to add the file in the pip wheel. After this @MoisesHer was able to get this working (hence the PR). But why is it not working all of a sudden, did something else change?

@MoisesHer are you using a diff pip wheel than before? Is that wheel built differently?

@mli
Copy link
Member

mli commented Feb 5, 2021

Job PR-1389/11 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/11/index.html

@MoisesHer
Copy link
Contributor Author

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling,
but I think we never included the lib_api.cc within a wheel

@samskalicky
Copy link

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling,
but I think we never included the lib_api.cc within a wheel

i added this line:
https://github.com/apache/incubator-mxnet/blob/v1.8.x/tools/pip/setup.py#L108

As part of the apache/mxnet#19393 for this specific reason (after our offline discussion on Slack). I remember building the wheel and making sure the file was there.

Maybe i built the wheel differently in my testing than the aws-mx one (or the ones on dist.mxnet.io).

Either way, ill work with @szha on apache/mxnet#19850 and we'll backport as necessary

@szha
Copy link
Member

szha commented Feb 5, 2021

@samskalicky the problem is that the files will only be included according to the manifest file. The if conditions in setup.py only copies the file in the location of packaging.

@MoisesHer
Copy link
Contributor Author

MoisesHer commented Feb 5, 2021

@szha apart of the .cc file, the remaining issue is not related to this PR:

./scripts/ner/finetune_bert.py", line 34, in <module>
[2021-02-05T19:30:49.978Z]     import seqeval.metrics
...
File "./scripts/intent_cls_slot_labeling/finetune_icsl.py", line 39, in <module>
[2021-02-05T19:48:32.411Z]     from seqeval.metrics import f1_score as ner_f1_score
...
AttributeError: module 'enum' has no attribute 'Flag'

@szha
Copy link
Member

szha commented Feb 5, 2021

@MoisesHer I see. I think the CI needs an update to the python versions. Working on it.

@szha
Copy link
Member

szha commented Feb 5, 2021

I finished updating the conda versions on all CI hosts (as well as the docker image). Will babysit this PR.

@mli
Copy link
Member

mli commented Feb 10, 2021

Job PR-1389/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/12/index.html

@szha
Copy link
Member

szha commented Feb 10, 2021

Not sure why python3.5 is still picked up. I'm looking into the CI.

@MoisesHer
Copy link
Contributor Author

Hi @szha, do you have any update? thanks

@szha
Copy link
Member

szha commented Mar 4, 2021

We will switch the 0.x CI to github actions after #1531

@codecov
Copy link

codecov bot commented Mar 21, 2021

Codecov Report

Merging #1389 (309351d) into v0.10.x (a3ba807) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           v0.10.x    #1389      +/-   ##
===========================================
- Coverage    33.44%   33.43%   -0.02%     
===========================================
  Files          155      155              
  Lines        15206    15213       +7     
===========================================
  Hits          5086     5086              
- Misses       10120    10127       +7     
Impacted Files Coverage Δ
scripts/bert/deploy.py 0.00% <0.00%> (ø)
scripts/bert/setup.py 0.00% <0.00%> (ø)

@github-actions
Copy link

@szha szha merged commit eed42b4 into dmlc:v0.10.x Mar 21, 2021
@szha
Copy link
Member

szha commented Mar 21, 2021

@MoisesHer thanks for the change. Would you also port the change to v0.x?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants