[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671
[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671mikaylagawarecki wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request migrates several CUDA kernels—including AWQ, AllSpark, DeepSeek V3 fused A GEMM, Hadacore, and CUTLASS MLA—from the standard extension to the stable ABI extension (_C_stable_libtorch). The changes involve updating CMakeLists.txt to reassign source files, replacing standard Torch types and macros with stable ABI equivalents (e.g., torch::stable::Tensor, STD_TORCH_CHECK), and implementing stable ABI-compliant utilities for device property caching and cuBLAS handle retrieval. Feedback highlights critical issues regarding thread safety with global workspace tensors, potential compilation failures when using non-movable types in containers, and the need for better bounds checking and naming consistency in the new utility functions.
| // Device properties cache for stable ABI compatibility. | ||
| // Uses raw CUDA/HIP APIs instead of ATen functions. | ||
| // Using inline ensures a single instance across all translation units. | ||
| inline std::deque<std::once_flag> device_flags; |
There was a problem hiding this comment.
The use of std::deque<std::once_flag> is problematic because std::once_flag is non-copyable and non-movable. While std::deque generally provides stable pointers to its elements, the resize operation (line 35) requires the type to be MoveInsertable according to the C++ standard, which std::once_flag is not. This will likely lead to compilation errors on many toolchains. A better approach is to initialize all device properties at once during the global initialization phase, removing the need for per-device once_flag containers.
There was a problem hiding this comment.
The code here is actually a very slight adaptation of the code in torch https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAContext.cpp#L12-L59 to make it stable
(granted torch uses c10::once_flag, but that is also non-copyable and non-movable which has the same issue)
Since the std::deque is only ever resized once from size 0 to num_devices, I don't think this is actually problematic. However, I I can fix this if anyone thinks it is problematic
There was a problem hiding this comment.
This sounds like this issue was pre-existing and not risky, so I think it is ok to leave as is in order to match previous behavior more closely.
| #include "core/registration.h" | ||
| #include "libtorch_stable/torch_utils.h" | ||
|
|
||
| torch::stable::Tensor as_g_workspace; |
There was a problem hiding this comment.
The global variable as_g_workspace of type torch::stable::Tensor introduces a significant race condition. In a multi-threaded or multi-stream environment, concurrent calls to allspark_w8a16_gemm will attempt to check and reallocate this global tensor (lines 991-996), leading to memory corruption or use-after-free errors when one thread overwrites the workspace while another is using it. For stable ABI compatibility and thread safety, workspace memory should be managed via a thread-local cache, a per-device map, or ideally passed as an argument from the Python allocator.
There was a problem hiding this comment.
pre-existing
8bd7514 to
2233700
Compare
janeyx99
left a comment
There was a problem hiding this comment.
I gotta review the torch utils in more detail
| _in_feats.mutable_data_ptr<torch::headeronly::Half>()); | ||
| auto kernel = reinterpret_cast<int*>(_kernel.mutable_data_ptr<int>()); | ||
| auto out_feats = reinterpret_cast<half*>( | ||
| _out_feats.mutable_data_ptr<torch::headeronly::Half>()); |
There was a problem hiding this comment.
How did you know to use mutable here vs const?
There was a problem hiding this comment.
matched the constness of the arguments to the dequantize_weights function https://github.com/mikaylagawarecki/vllm/blob/2233700e84e05b4ce2b0169f41a573fd979e124a/csrc/libtorch_stable/quantization/awq/gemm_kernels.cu#L352-L353
There was a problem hiding this comment.
ah my new understanding is that if we're launching a kernel expecting to modify the results, then the pointers should be mutable?
tho in this case i agree just matching the caller's types is fine
| @@ -157,7 +168,7 @@ void rearrange_kn_weight_as_n32k16_order( | |||
| } | |||
| } | |||
|
|
|||
| TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) { | |||
| STABLE_TORCH_LIBRARY_IMPL(_C, CUDA, m) { | |||
There was a problem hiding this comment.
I might have missed this earlier, but why do we change the lib name to _C instead of keeping it modifiable through the var?
There was a problem hiding this comment.
TORCH_EXTENSION_NAME is _C_stable_libtorch but we want the ops to be registered as torch.ops._C for backward compatibility
2eebfb2 to
663fa46
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
663fa46 to
b874ea2
Compare
|
Hi @mikaylagawarecki, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
b874ea2 to
bd83fe6
Compare
|
Hi @mikaylagawarecki, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
bd83fe6 to
2c13410
Compare
2c13410 to
bdf7bed
Compare
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
bdf7bed to
fe13ae4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This PR has been landed in #42339 and can be closed now |
|
Superseded by newer PRs. |
Purpose
#26946
Test Plan
On A100
On H100
On B200
Deepseek gemm kernel does not seem to have a test
Test Result
A100:

H100:


B200:

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)