Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

SequenceMask is slow on GPU for BERT #14124

Closed
eric-haibin-lin opened this issue Feb 12, 2019 · 5 comments
Closed

SequenceMask is slow on GPU for BERT #14124

eric-haibin-lin opened this issue Feb 12, 2019 · 5 comments

Comments

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Feb 12, 2019

I observed that SequenceMask and _backward_SequenceMask takes noticeable amount of time when I train BERT model on V100 GPUs. Specifically, I observed that most of ops take ~ 0.1ms per invocation on average, but SequenceMask takes 4ms. It looks like SequenceMask kernel is not parallelized by element, but rather by the batch size:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/sequence_mask-inl.h#L87-L103

Can we make this faster? @sbodenstein

Sample input:

data.shape = (8L, 512L, 768L)
seq_length = [ 18.  35.  34. 100. 110. 194. 512.  10.]
dtype = "float16"
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@eric-haibin-lin eric-haibin-lin changed the title SequenceMask is slow on GPU SequenceMask is slow on GPU for BERT Feb 12, 2019
@stephenrawls
Copy link
Contributor

+1

We have also noticed this.

As a work around in some of our code, we have re-written to manually mask thinks out via broadcast_mul with manually constructed masks.

Would be nice for SequenceMask itself to be faster though.

@sbodenstein
Copy link
Contributor

sbodenstein commented Feb 15, 2019

@eric-haibin-lin: that is a great catch, this should definitely be fast enough to throttle BERT.

I won't have time in the next 2 weeks to work on this. So would be happy for someone else to rewrite if this is urgent!

Its also a pity that neither TensorFlow nor PyTorch have this operator (and an efficient GPU implementation we can quickly adapt).

@haojin2
Copy link
Contributor

haojin2 commented Mar 16, 2019

Fix in #14445. Achieved ~80x speedup of both forward and backward on mentioned workload on GPU.

@szha
Copy link
Member

szha commented Mar 16, 2019

Awesome job!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants