-
Notifications
You must be signed in to change notification settings - Fork 6.8k
SequenceMask is slow on GPU for BERT #14124
Comments
Hey, this is the MXNet Label Bot. |
+1 We have also noticed this. As a work around in some of our code, we have re-written to manually mask thinks out via broadcast_mul with manually constructed masks. Would be nice for SequenceMask itself to be faster though. |
@eric-haibin-lin: that is a great catch, this should definitely be fast enough to throttle BERT. I won't have time in the next 2 weeks to work on this. So would be happy for someone else to rewrite if this is urgent! Its also a pity that neither TensorFlow nor PyTorch have this operator (and an efficient GPU implementation we can quickly adapt). |
Fix in #14445. Achieved ~80x speedup of both forward and backward on mentioned workload on GPU. |
Awesome job! |
I observed that
SequenceMask
and_backward_SequenceMask
takes noticeable amount of time when I train BERT model on V100 GPUs. Specifically, I observed that most of ops take ~ 0.1ms per invocation on average, butSequenceMask
takes 4ms. It looks like SequenceMask kernel is not parallelized by element, but rather by the batch size:https://github.com/apache/incubator-mxnet/blob/master/src/operator/sequence_mask-inl.h#L87-L103
Can we make this faster? @sbodenstein
Sample input:
The text was updated successfully, but these errors were encountered: