[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI by mikaylagawarecki · Pull Request #38671 · vllm-project/vllm

mikaylagawarecki · 2026-04-01T00:14:13Z

Purpose

Test Plan

On A100

  python -m pytest tests/kernels/quantization/test_allspark_gemm.py

On H100

  python -m pytest tests/kernels/quantization/test_hadacore.py
  python -m pytest tests/kernels/quantization/test_awq.py

On B200

  python -m pytest tests/kernels/attention/test_cutlass_mla_decode.py

Deepseek gemm kernel does not seem to have a test

Test Result

A100:

H100:

B200:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

gemini-code-assist

Code Review

This pull request migrates several CUDA kernels—including AWQ, AllSpark, DeepSeek V3 fused A GEMM, Hadacore, and CUTLASS MLA—from the standard extension to the stable ABI extension (_C_stable_libtorch). The changes involve updating CMakeLists.txt to reassign source files, replacing standard Torch types and macros with stable ABI equivalents (e.g., torch::stable::Tensor, STD_TORCH_CHECK), and implementing stable ABI-compliant utilities for device property caching and cuBLAS handle retrieval. Feedback highlights critical issues regarding thread safety with global workspace tensors, potential compilation failures when using non-movable types in containers, and the need for better bounds checking and naming consistency in the new utility functions.

gemini-code-assist · 2026-04-01T00:19:03Z

+// Device properties cache for stable ABI compatibility.
+// Uses raw CUDA/HIP APIs instead of ATen functions.
+// Using inline ensures a single instance across all translation units.
+inline std::deque<std::once_flag> device_flags;


The use of std::deque<std::once_flag> is problematic because std::once_flag is non-copyable and non-movable. While std::deque generally provides stable pointers to its elements, the resize operation (line 35) requires the type to be MoveInsertable according to the C++ standard, which std::once_flag is not. This will likely lead to compilation errors on many toolchains. A better approach is to initialize all device properties at once during the global initialization phase, removing the need for per-device once_flag containers.

The code here is actually a very slight adaptation of the code in torch https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAContext.cpp#L12-L59 to make it stable

(granted torch uses c10::once_flag, but that is also non-copyable and non-movable which has the same issue)

Since the std::deque is only ever resized once from size 0 to num_devices, I don't think this is actually problematic. However, I I can fix this if anyone thinks it is problematic

This sounds like this issue was pre-existing and not risky, so I think it is ok to leave as is in order to match previous behavior more closely.

gemini-code-assist · 2026-04-01T00:19:03Z

+#include "core/registration.h"
+#include "libtorch_stable/torch_utils.h"
+
+torch::stable::Tensor as_g_workspace;


The global variable as_g_workspace of type torch::stable::Tensor introduces a significant race condition. In a multi-threaded or multi-stream environment, concurrent calls to allspark_w8a16_gemm will attempt to check and reallocate this global tensor (lines 991-996), leading to memory corruption or use-after-free errors when one thread overwrites the workspace while another is using it. For stable ABI compatibility and thread safety, workspace memory should be managed via a thread-local cache, a per-device map, or ideally passed as an argument from the Python allocator.

pre-existing

janeyx99

I gotta review the torch utils in more detail

janeyx99 · 2026-04-01T15:04:40Z

+      _in_feats.mutable_data_ptr<torch::headeronly::Half>());
+  auto kernel = reinterpret_cast<int*>(_kernel.mutable_data_ptr<int>());
+  auto out_feats = reinterpret_cast<half*>(
+      _out_feats.mutable_data_ptr<torch::headeronly::Half>());


How did you know to use mutable here vs const?

matched the constness of the arguments to the dequantize_weights function https://github.com/mikaylagawarecki/vllm/blob/2233700e84e05b4ce2b0169f41a573fd979e124a/csrc/libtorch_stable/quantization/awq/gemm_kernels.cu#L352-L353

ah my new understanding is that if we're launching a kernel expecting to modify the results, then the pointers should be mutable?

tho in this case i agree just matching the caller's types is fine

janeyx99 · 2026-04-01T15:08:19Z

@@ -157,7 +168,7 @@ void rearrange_kn_weight_as_n32k16_order(
  }
 }

-TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
+STABLE_TORCH_LIBRARY_IMPL(_C, CUDA, m) {


I might have missed this earlier, but why do we change the lib name to _C instead of keeping it modifiable through the var?

TORCH_EXTENSION_NAME is _C_stable_libtorch but we want the ops to be registered as torch.ops._C for backward compatibility

mergify · 2026-04-01T19:01:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-01T21:56:45Z

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-01T22:02:21Z

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-02T04:40:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mergify · 2026-04-08T19:22:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-18T14:45:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

janeyx99 · 2026-05-18T19:46:34Z

This PR has been landed in #42339 and can be closed now

Harry-Chen · 2026-05-20T13:08:20Z

Superseded by newer PRs.

mergify Bot added ci/build nvidia labels Apr 1, 2026

github-project-automation Bot added this to NVIDIA Apr 1, 2026

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch 2 times, most recently from 8bd7514 to 2233700 Compare April 1, 2026 00:57

mikaylagawarecki marked this pull request as ready for review April 1, 2026 14:55

mikaylagawarecki requested review from LucasWilkinson and tlrmchlsmth as code owners April 1, 2026 14:55

janeyx99 reviewed Apr 1, 2026

View reviewed changes

janeyx99 approved these changes Apr 1, 2026

View reviewed changes

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch 2 times, most recently from 2eebfb2 to 663fa46 Compare April 1, 2026 16:03

mergify Bot added the needs-rebase label Apr 1, 2026

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch from 663fa46 to b874ea2 Compare April 1, 2026 21:52

mergify Bot removed the needs-rebase label Apr 1, 2026

mikaylagawarecki mentioned this pull request Apr 1, 2026

[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI #38757

Closed

5 tasks

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch from b874ea2 to bd83fe6 Compare April 1, 2026 21:58

mergify Bot added the needs-rebase label Apr 2, 2026

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch from bd83fe6 to 2c13410 Compare April 2, 2026 16:27

mergify Bot removed the needs-rebase label Apr 2, 2026

zou3519 approved these changes Apr 3, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Apr 3, 2026

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch from 2c13410 to bdf7bed Compare April 7, 2026 23:13

BoyuanFeng added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2026

Move CUTLASS MLA files from csrc to csrc/libtorch_stable

620000e

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki added 9 commits April 8, 2026 02:29

[a/n] Migrate CUTLASS MLA to torch stable ABI

3b8a04d

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Move Hadamard files from csrc to csrc/libtorch_stable

4d68f27

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

[b/n] Migrate Hadamard (hadacore) kernel to torch stable ABI

34b3625

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Move AWQ files from csrc to csrc/libtorch_stable

49f178b

Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

[c/n] Migrate AWQ kernels to torch stable ABI

c0c6e56

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Move DSV3 fused A GEMM from csrc to csrc/libtorch_stable

53ce91b

Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

[d/n] Migrate DSV3 fused A GEMM to torch stable ABI

7a3a6ef

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Move AllSpark files from csrc to csrc/libtorch_stable

532bfc5

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

[e/n] Migrate AllSpark kernels to torch stable ABI

fe13ae4

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki force-pushed the new-stable-abi-phase5 branch from bdf7bed to fe13ae4 Compare April 8, 2026 09:29

mergify Bot added the needs-rebase label Apr 8, 2026

This was referenced May 11, 2026

[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI (continued) #42339

Merged

[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) #42663

Merged

mergify Bot removed the needs-rebase label May 18, 2026

mergify Bot added the needs-rebase label May 18, 2026

Harry-Chen closed this May 20, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA May 20, 2026

Uh oh!

Conversation

mikaylagawarecki commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

janeyx99 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

janeyx99 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

mergify Bot commented Apr 2, 2026

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

mergify Bot commented May 18, 2026

Uh oh!

janeyx99 commented May 18, 2026

Uh oh!

Harry-Chen commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mikaylagawarecki commented Apr 1, 2026 •

edited

Loading

mikaylagawarecki Apr 1, 2026 •

edited

Loading

mikaylagawarecki Apr 1, 2026 •

edited

Loading