SequenceMask is slow on GPU for BERT #14124

eric-haibin-lin · 2019-02-12T05:59:21Z

I observed that SequenceMask and _backward_SequenceMask takes noticeable amount of time when I train BERT model on V100 GPUs. Specifically, I observed that most of ops take ~ 0.1ms per invocation on average, but SequenceMask takes 4ms. It looks like SequenceMask kernel is not parallelized by element, but rather by the batch size:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/sequence_mask-inl.h#L87-L103

Can we make this faster? @sbodenstein

Sample input:

data.shape = (8L, 512L, 768L)
seq_length = [ 18.  35.  34. 100. 110. 194. 512.  10.]
dtype = "float16"

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-02-12T05:59:23Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

stephenrawls · 2019-02-13T03:46:44Z

+1

We have also noticed this.

As a work around in some of our code, we have re-written to manually mask thinks out via broadcast_mul with manually constructed masks.

Would be nice for SequenceMask itself to be faster though.

sbodenstein · 2019-02-15T10:42:10Z

@eric-haibin-lin: that is a great catch, this should definitely be fast enough to throttle BERT.

I won't have time in the next 2 weeks to work on this. So would be happy for someone else to rewrite if this is urgent!

Its also a pity that neither TensorFlow nor PyTorch have this operator (and an efficient GPU implementation we can quickly adapt).

haojin2 · 2019-03-16T02:16:08Z

Fix in #14445. Achieved ~80x speedup of both forward and backward on mentioned workload on GPU.

szha · 2019-03-16T02:30:26Z

Awesome job!

eric-haibin-lin added Performance Operator CUDA labels Feb 12, 2019

eric-haibin-lin changed the title ~~SequenceMask is slow on GPU~~ SequenceMask is slow on GPU for BERT Feb 12, 2019

eric-haibin-lin mentioned this issue Feb 12, 2019

[BERT] Distributed Training Support dmlc/gluon-nlp#478

Closed

haojin2 mentioned this issue Mar 16, 2019

Speedup SequenceMask on GPU #14445

Merged

6 tasks

eric-haibin-lin closed this as completed Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequenceMask is slow on GPU for BERT #14124

SequenceMask is slow on GPU for BERT #14124

eric-haibin-lin commented Feb 12, 2019 •

edited

Loading

mxnet-label-bot commented Feb 12, 2019

stephenrawls commented Feb 13, 2019

sbodenstein commented Feb 15, 2019 •

edited

Loading

haojin2 commented Mar 16, 2019

szha commented Mar 16, 2019

SequenceMask is slow on GPU for BERT #14124

SequenceMask is slow on GPU for BERT #14124

Comments

eric-haibin-lin commented Feb 12, 2019 • edited Loading

mxnet-label-bot commented Feb 12, 2019

stephenrawls commented Feb 13, 2019

sbodenstein commented Feb 15, 2019 • edited Loading

haojin2 commented Mar 16, 2019

szha commented Mar 16, 2019

eric-haibin-lin commented Feb 12, 2019 •

edited

Loading

sbodenstein commented Feb 15, 2019 •

edited

Loading