Skip to content

[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671

Closed
mikaylagawarecki wants to merge 10 commits into
vllm-project:mainfrom
mikaylagawarecki:new-stable-abi-phase5
Closed

[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671
mikaylagawarecki wants to merge 10 commits into
vllm-project:mainfrom
mikaylagawarecki:new-stable-abi-phase5

Conversation

@mikaylagawarecki

@mikaylagawarecki mikaylagawarecki commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Purpose

#26946

Test Plan

On A100

  python -m pytest tests/kernels/quantization/test_allspark_gemm.py          

On H100

  python -m pytest tests/kernels/quantization/test_hadacore.py
  python -m pytest tests/kernels/quantization/test_awq.py      

On B200

  python -m pytest tests/kernels/attention/test_cutlass_mla_decode.py

Deepseek gemm kernel does not seem to have a test

Test Result

A100:
Screenshot 2026-03-31 at 8 11 43 PM

H100:
Screenshot 2026-03-31 at 8 17 17 PM
Screenshot 2026-03-31 at 8 18 39 PM

B200:
Screenshot 2026-03-31 at 8 13 28 PM


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates several CUDA kernels—including AWQ, AllSpark, DeepSeek V3 fused A GEMM, Hadacore, and CUTLASS MLA—from the standard extension to the stable ABI extension (_C_stable_libtorch). The changes involve updating CMakeLists.txt to reassign source files, replacing standard Torch types and macros with stable ABI equivalents (e.g., torch::stable::Tensor, STD_TORCH_CHECK), and implementing stable ABI-compliant utilities for device property caching and cuBLAS handle retrieval. Feedback highlights critical issues regarding thread safety with global workspace tensors, potential compilation failures when using non-movable types in containers, and the need for better bounds checking and naming consistency in the new utility functions.

// Device properties cache for stable ABI compatibility.
// Uses raw CUDA/HIP APIs instead of ATen functions.
// Using inline ensures a single instance across all translation units.
inline std::deque<std::once_flag> device_flags;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of std::deque<std::once_flag> is problematic because std::once_flag is non-copyable and non-movable. While std::deque generally provides stable pointers to its elements, the resize operation (line 35) requires the type to be MoveInsertable according to the C++ standard, which std::once_flag is not. This will likely lead to compilation errors on many toolchains. A better approach is to initialize all device properties at once during the global initialization phase, removing the need for per-device once_flag containers.

@mikaylagawarecki mikaylagawarecki Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is actually a very slight adaptation of the code in torch https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAContext.cpp#L12-L59 to make it stable

(granted torch uses c10::once_flag, but that is also non-copyable and non-movable which has the same issue)

Since the std::deque is only ever resized once from size 0 to num_devices, I don't think this is actually problematic. However, I I can fix this if anyone thinks it is problematic

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like this issue was pre-existing and not risky, so I think it is ok to leave as is in order to match previous behavior more closely.

Comment thread csrc/libtorch_stable/torch_utils.h
Comment thread csrc/libtorch_stable/torch_utils.h
#include "core/registration.h"
#include "libtorch_stable/torch_utils.h"

torch::stable::Tensor as_g_workspace;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The global variable as_g_workspace of type torch::stable::Tensor introduces a significant race condition. In a multi-threaded or multi-stream environment, concurrent calls to allspark_w8a16_gemm will attempt to check and reallocate this global tensor (lines 991-996), leading to memory corruption or use-after-free errors when one thread overwrites the workspace while another is using it. For stable ABI compatibility and thread safety, workspace memory should be managed via a thread-local cache, a per-device map, or ideally passed as an argument from the Python allocator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing

@mikaylagawarecki mikaylagawarecki force-pushed the new-stable-abi-phase5 branch 2 times, most recently from 8bd7514 to 2233700 Compare April 1, 2026 00:57
@mikaylagawarecki mikaylagawarecki marked this pull request as ready for review April 1, 2026 14:55

@janeyx99 janeyx99 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta review the torch utils in more detail

_in_feats.mutable_data_ptr<torch::headeronly::Half>());
auto kernel = reinterpret_cast<int*>(_kernel.mutable_data_ptr<int>());
auto out_feats = reinterpret_cast<half*>(
_out_feats.mutable_data_ptr<torch::headeronly::Half>());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you know to use mutable here vs const?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my new understanding is that if we're launching a kernel expecting to modify the results, then the pointers should be mutable?

tho in this case i agree just matching the caller's types is fine

@@ -157,7 +168,7 @@ void rearrange_kn_weight_as_n32k16_order(
}
}

TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
STABLE_TORCH_LIBRARY_IMPL(_C, CUDA, m) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed this earlier, but why do we change the lib name to _C instead of keeping it modifiable through the var?

@mikaylagawarecki mikaylagawarecki Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH_EXTENSION_NAME is _C_stable_libtorch but we want the ops to be registered as torch.ops._C for backward compatibility

Comment thread csrc/libtorch_stable/quantization/hadamard/hadacore/hadamard_transform_cuda.cu Outdated
@mikaylagawarecki mikaylagawarecki force-pushed the new-stable-abi-phase5 branch 2 times, most recently from 2eebfb2 to 663fa46 Compare April 1, 2026 16:03
@mergify

mergify Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify

mergify Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify

mergify Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify

mergify Bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 2, 2026
@mergify mergify Bot removed the needs-rebase label Apr 2, 2026
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Apr 3, 2026
@BoyuanFeng BoyuanFeng added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2026
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
@mergify

mergify Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify

mergify Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 18, 2026
@janeyx99

Copy link
Copy Markdown
Contributor

This PR has been landed in #42339 and can be closed now

@Harry-Chen

Copy link
Copy Markdown
Member

Superseded by newer PRs.

@Harry-Chen Harry-Chen closed this May 20, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants