gpu: nvidia: Add support for cublaslt matmul #1972

dylan-angus-codeplay · 2024-06-21T15:36:08Z

Description

Adds support for using the cublaslt API for IMMA kernels and when the bias and/or relu post-op can be merged into the cublaslt epilogue.

dzarukin

Common part looks good for me. Didn't review cudnn part.

mgouicem · 2024-06-25T07:22:33Z

src/gpu/nvidia/cudnn_matmul_executor.hpp

+            interop_task(matmul_impl_, engine, cgh, cuda_stream, arg_wt,
+                    arg_src, arg_dst, arg_bias, arg_algo_scratch,
+                    arg_bias_scratch, arg_block_a_scratch, arg_block_b_scratch,
+                    arg_block_c_scratch, arg_src_scale, arg_wei_scale,
                    arg_dst_scale);


IIUC, all these execute functions are almost identical. Would it make sense to use a single execute function in base class in order to avoid boilerplate?
This common execute function could use conditionals to guard some pieces of code, for example:

if (has_runtime_dims()) init_scratch_buffers(bias_scratch_size, algo_scratch_size);

mgouicem · 2024-06-25T07:28:08Z

src/gpu/nvidia/cudnn_matmul_lt_impl.hpp

+                transform_matrix(lt_handle, a_layout_, a, blocked_a_layout_,
+                        block_a_scratch, trans_a_, streamId);
+                a = block_a_scratch;
+            }


I guess the most optimal would be to do the reorders separate from execute call, to let user schedule those and reduce copy overheads.

In any case, is there any benefit to do this reorder inside execute call?

Without adding this transform inside the excecute the IMMA kernels for the cublasLT will only run with Ab32a for the weights and output with the transform we can support other data formats as well.

The user can do the transform currently for the weights by reordering to Ab32a and passing it to the matmul pd as such, the transform will not be called in this case.

For now we do not have a reorder format mapping to the input matrix to the ampere blocked format (due to the interleaved nature of the blocking) hence we need to do the transform inside the execute.

For the output to schedule a transform at another time the user can set the output of the matmul to Ab32a and reorder after.

mgouicem · 2024-06-25T07:30:33Z

src/gpu/nvidia/cudnn_matmul_lt_impl.hpp

+                    (CUdeviceptr)dst_scale, sizeof(float), streamId);
+            // For eltwise post-ops, apply the dst scale afterward
+            if (!with_separate_eltwise_) scale /= host_dst_scale;
+        }


I believe cublasLT can be configured to take device pointers (CUBLASLT_POINTER_MODE_DEVICE)

That is true, however src,wei and dst scale map to the cublasLT matmul alpha param and the post op sum maps to beta i do not believe using the device pointer mode would work in this case apart from creating a sycl kernel to do
alpha = (1 * src_scale * wei_scale) / dst_scale on device side before passing to the matmul.
Would that be the preferred approach?

It would work with oneDNN semantics if only this implementation won't support post-ops. If it would, scales should be separated as src-wei scales are applied before post-ops and dst scales are applied at the end.

vpirogov · 2024-06-25T21:28:18Z

make test
enable device_gpu
enable thr_sycl
enable thr_cuda

ShanoToni · 2024-06-27T16:31:33Z

As a note to the latest commit, IMMA kernels do not support output type set to int8 for cublasLT with CUDA versions prior to 12. This is now handled at compile time with a define: if CUDA version is less than 12 primitive descriptor returns not supported for those cases.

Matmul PD refactor Skipping unsupported tests for lt impl Addressed MR comments

Added checks to new bias

gpu: nvidia: Add support for cublaslt matmul

b5a8563

vpirogov added the platform:nvidia-gpu label Jun 21, 2024

vpirogov added this to the v3.6 milestone Jun 21, 2024

dzarukin approved these changes Jun 21, 2024

View reviewed changes

mgouicem reviewed Jun 25, 2024

View reviewed changes

Refactor of matmul executors

08a8057

ShanoToni added 2 commits July 16, 2024 15:15

Added new separate classes for matmul pd WIP

77578f5

Matmul PD refactor Skipping unsupported tests for lt impl Addressed MR comments

Adding bias to IMMA lt matmul case

67d382e

Added checks to new bias

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: nvidia: Add support for cublaslt matmul #1972

gpu: nvidia: Add support for cublaslt matmul #1972

dylan-angus-codeplay commented Jun 21, 2024

dzarukin left a comment

mgouicem Jun 25, 2024

mgouicem Jun 25, 2024

ShanoToni Jun 25, 2024

mgouicem Jun 25, 2024

ShanoToni Jun 27, 2024

dzarukin Jul 1, 2024

vpirogov commented Jun 25, 2024

ShanoToni commented Jun 27, 2024

gpu: nvidia: Add support for cublaslt matmul #1972

Are you sure you want to change the base?

gpu: nvidia: Add support for cublaslt matmul #1972

Conversation

dylan-angus-codeplay commented Jun 21, 2024

Description

dzarukin left a comment

Choose a reason for hiding this comment

mgouicem Jun 25, 2024

Choose a reason for hiding this comment

mgouicem Jun 25, 2024

Choose a reason for hiding this comment

ShanoToni Jun 25, 2024

Choose a reason for hiding this comment

mgouicem Jun 25, 2024

Choose a reason for hiding this comment

ShanoToni Jun 27, 2024

Choose a reason for hiding this comment

dzarukin Jul 1, 2024

Choose a reason for hiding this comment

vpirogov commented Jun 25, 2024

ShanoToni commented Jun 27, 2024