Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Batch_dot does not support FP16 well #11796

Closed
szhengac opened this issue Jul 18, 2018 · 9 comments
Closed

Batch_dot does not support FP16 well #11796

szhengac opened this issue Jul 18, 2018 · 9 comments

Comments

@szhengac
Copy link
Contributor

The batch_dot does not support FP16 well and can make training slower compared to using FP32. This is tested using Transformer model in Gluonnlp. This feature has been added in a NVIDIA mxnet. So I think it is good to enable this in the master.

@szha szha closed this as completed Jul 18, 2018
@szha szha reopened this Jul 18, 2018
@szha
Copy link
Member

szha commented Jul 18, 2018

Oops, wrong button. Relevant links: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/dot-inl.h#L1347-L1364 https://github.com/dmlc/mshadow/blob/master/mshadow/dot_engine-inl.h#L528-L539. While for float the strided gemm is used, the half_t type is calling regular gemm. Instead, the strided gemm in cublas can be used which supports half_t: https://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemmbatched

@ptrendx
Copy link
Member

ptrendx commented Jul 18, 2018

@DickJC123

@eric-haibin-lin
Copy link
Member

I'm adding cublas strided gemm calls in mshadow dmlc/mshadow#353

@eric-haibin-lin eric-haibin-lin self-assigned this Aug 4, 2018
@eric-haibin-lin
Copy link
Member

merged

@sbodenstein
Copy link
Contributor

@szha: can we reopen this? For some reason, the fix in dmlc/mshadow#353 was reverted by this commit by @eric-haibin-lin .

This code, run on version 1.3.0 (latest EC2 Deep Learning AMI):

import mxnet as mx
a = mx.nd.ones((100,100,100), ctx=mx.gpu(), dtype='float16')
b = mx.nd.ones((100,100,100), ctx=mx.gpu(), dtype='float16')
for i in range(10):
    c = mx.nd.batch_dot(a,b)
mx.nd.waitall()
import time
begin = time.time()
for i in range(500):
    c = mx.nd.batch_dot(a,b)
mx.nd.waitall()
end = time.time()
print(end - begin)

takes 0.9s on a V100 (and 0.0318s when using float32 instead, a 30x slowdown!)

We want to implement transformers using TensorCores for training, but there is no way of doing this in MXNet at the moment (linalg_gemm and linalg_gemm2 unfortunately don't support float16 either, despite it seemingly being implemented here).

What is the plan for exposing any form of GEMM to users with Real16 and TensorCore support?

@szhengac

@eric-haibin-lin
Copy link
Member

Sorry about the revert. I found that it is better to implement fp16 ops in mxnet instead of in mshadow, since there are built in functionality to detect/enable tensorcore. I can make a PR in maybe two or three days. @sbodenstein are you using symbol or gluon to train transformer?

@sbodenstein
Copy link
Contributor

@eric-haibin-lin: we are using symbol to train transformer. That would be great to reenable this as soon as possible.

Is there any reason to not expose linalg_gemm and linalg_gemm2 float16-support on GPU as well?

@sbodenstein
Copy link
Contributor

@eric-haibin-lin: any updates about this?

@eric-haibin-lin
Copy link
Member

Added in #13716

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants