Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: Add support for cublaslt matmul #1972

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dylan-angus-codeplay
Copy link
Contributor

Description

Adds support for using the cublaslt API for IMMA kernels and when the bias and/or relu post-op can be merged into the cublaslt epilogue.

@vpirogov vpirogov added this to the v3.6 milestone Jun 21, 2024
Copy link
Contributor

@dzarukin dzarukin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common part looks good for me. Didn't review cudnn part.

interop_task(matmul_impl_, engine, cgh, cuda_stream, arg_wt,
arg_src, arg_dst, arg_bias, arg_algo_scratch,
arg_bias_scratch, arg_block_a_scratch, arg_block_b_scratch,
arg_block_c_scratch, arg_src_scale, arg_wei_scale,
arg_dst_scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, all these execute functions are almost identical. Would it make sense to use a single execute function in base class in order to avoid boilerplate?
This common execute function could use conditionals to guard some pieces of code, for example:

if (has_runtime_dims())
    init_scratch_buffers(bias_scratch_size, algo_scratch_size);

transform_matrix(lt_handle, a_layout_, a, blocked_a_layout_,
block_a_scratch, trans_a_, streamId);
a = block_a_scratch;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most optimal would be to do the reorders separate from execute call, to let user schedule those and reduce copy overheads.

In any case, is there any benefit to do this reorder inside execute call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without adding this transform inside the excecute the IMMA kernels for the cublasLT will only run with Ab32a for the weights and output with the transform we can support other data formats as well.

The user can do the transform currently for the weights by reordering to Ab32a and passing it to the matmul pd as such, the transform will not be called in this case.

For now we do not have a reorder format mapping to the input matrix to the ampere blocked format (due to the interleaved nature of the blocking) hence we need to do the transform inside the execute.

For the output to schedule a transform at another time the user can set the output of the matmul to Ab32a and reorder after.

(CUdeviceptr)dst_scale, sizeof(float), streamId);
// For eltwise post-ops, apply the dst scale afterward
if (!with_separate_eltwise_) scale /= host_dst_scale;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe cublasLT can be configured to take device pointers (CUBLASLT_POINTER_MODE_DEVICE)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, however src,wei and dst scale map to the cublasLT matmul alpha param and the post op sum maps to beta i do not believe using the device pointer mode would work in this case apart from creating a sycl kernel to do
alpha = (1 * src_scale * wei_scale) / dst_scale on device side before passing to the matmul.
Would that be the preferred approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work with oneDNN semantics if only this implementation won't support post-ops. If it would, scales should be separated as src-wei scales are applied before post-ops and dst scales are applied at the end.

@vpirogov
Copy link
Member

make test
enable device_gpu
enable thr_sycl
enable thr_cuda

@ShanoToni
Copy link

As a note to the latest commit, IMMA kernels do not support output type set to int8 for cublasLT with CUDA versions prior to 12. This is now handled at compile time with a define: if CUDA version is less than 12 primitive descriptor returns not supported for those cases.

Matmul PD refactor

Skipping unsupported tests for lt impl

Addressed MR comments
Added checks to new bias
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants