-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: nvidia: Add support for cublaslt matmul #1972
base: main
Are you sure you want to change the base?
gpu: nvidia: Add support for cublaslt matmul #1972
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common part looks good for me. Didn't review cudnn part.
interop_task(matmul_impl_, engine, cgh, cuda_stream, arg_wt, | ||
arg_src, arg_dst, arg_bias, arg_algo_scratch, | ||
arg_bias_scratch, arg_block_a_scratch, arg_block_b_scratch, | ||
arg_block_c_scratch, arg_src_scale, arg_wei_scale, | ||
arg_dst_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, all these execute functions are almost identical. Would it make sense to use a single execute function in base class in order to avoid boilerplate?
This common execute function could use conditionals to guard some pieces of code, for example:
if (has_runtime_dims())
init_scratch_buffers(bias_scratch_size, algo_scratch_size);
transform_matrix(lt_handle, a_layout_, a, blocked_a_layout_, | ||
block_a_scratch, trans_a_, streamId); | ||
a = block_a_scratch; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the most optimal would be to do the reorders separate from execute call, to let user schedule those and reduce copy overheads.
In any case, is there any benefit to do this reorder inside execute call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without adding this transform inside the excecute the IMMA kernels for the cublasLT will only run with Ab32a
for the weights and output with the transform we can support other data formats as well.
The user can do the transform currently for the weights by reordering to Ab32a
and passing it to the matmul pd as such, the transform will not be called in this case.
For now we do not have a reorder format mapping to the input matrix to the ampere blocked format (due to the interleaved nature of the blocking) hence we need to do the transform inside the execute.
For the output to schedule a transform at another time the user can set the output of the matmul to Ab32a
and reorder after.
(CUdeviceptr)dst_scale, sizeof(float), streamId); | ||
// For eltwise post-ops, apply the dst scale afterward | ||
if (!with_separate_eltwise_) scale /= host_dst_scale; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe cublasLT can be configured to take device pointers (CUBLASLT_POINTER_MODE_DEVICE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, however src
,wei
and dst
scale map to the cublasLT matmul alpha
param and the post op sum maps to beta
i do not believe using the device pointer mode would work in this case apart from creating a sycl kernel to do
alpha = (1 * src_scale * wei_scale) / dst_scale
on device side before passing to the matmul.
Would that be the preferred approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work with oneDNN semantics if only this implementation won't support post-ops. If it would, scales should be separated as src-wei scales are applied before post-ops and dst scales are applied at the end.
make test |
As a note to the latest commit, IMMA kernels do not support output type set to int8 for cublasLT with CUDA versions prior to 12. This is now handled at compile time with a define: if CUDA version is less than 12 primitive descriptor returns not supported for those cases. |
Matmul PD refactor Skipping unsupported tests for lt impl Addressed MR comments
Added checks to new bias
Description
Adds support for using the cublaslt API for IMMA kernels and when the bias and/or relu post-op can be merged into the cublaslt epilogue.