[1/n] Migrate activation kernels to libtorch stable ABI by mikaylagawarecki · Pull Request #30908 · vllm-project/vllm

mikaylagawarecki · 2025-12-17T22:49:35Z

Purpose

This change requires 2.10+ to build

First step towards migration of the cuda wheel to the libtorch stable ABI see #26946. The benefit of migrating to the libtorch stable ABI is that the stable binary will be able to be built once (with torch 2.10 or greater) and can run in all versions of torch >= 2.10.

activation_kernels.cu is migrated to use the stable ABI/API
Creates csrc/torch_bindings_stable.cpp that registers kernels in activation_kernels.cu via TORCH_STABLE_LIBRARY_FRAGMENT to the _C namespace
Creates a new _C_stable.so in addition to the existing _C.so that has sources csrc/activation_kernels.cu
csrc/torch_bindings_stable.cpp

I am looking for feedback on the following

The PR separates the stable kernels into a new _C_stable target, is this ok?
a. The rationale is that this extension is built with -DTORCH_TARGET_VERSION=... (ensuring only stable APIs are used)
b. Kernels in this extension are still registered to the _C namespace for backward compat.
The PR requires torch version >=2.10 in order to build _C_stable, I'm wondering whether that is alright
a. It can be built once with any version >= 2.10 and will be able to run in all versions of torch >= 2.10

Test Plan

pytest tests/kernels/core/test_activation.py

Test Result

---

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request migrates the activation kernels to use the PyTorch stable ABI. This is a good step towards improving the long-term maintainability of the CUDA kernels by depending on a stable API. The changes are well-structured, creating a new _C_stable extension for the migrated kernels and updating the build system accordingly.

My review focuses on the implementation details of this migration. I've identified an opportunity to improve the maintainability of the new CUDA code by refactoring duplicated logic in the kernel launch macros. This will make the code cleaner and less prone to errors in future modifications.

csrc/activation_kernels.cu

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

csrc/ops.h

CMakeLists.txt

mikaylagawarecki · 2025-12-18T17:16:45Z

cc @youkaichao @mgoin @FlamingoPg @simon-mo

Harry-Chen · 2025-12-19T11:09:08Z

Thank you for putting it together! This is definitely what we need.

A general question: does it make sense to migrate to libtorch stable ABI in partial? In other words,
the remaining use of libtorch ("not stable") ABIs still exist, and they will still bind us to a specific version of libtorch.so.
Of course, this work is not easy and could be done part by part. But unless we could get rid of all of them, this migration will not take actual effect.

Other minor comments for your changes:

The PR separates the stable kernels into a new _C_stable target, is this ok?

What about a name like _C_stable_libtorch, just for more clarity?

b. Kernels in this extension are still registered to the _C namespace for backward compat.

IIUC this will lead to duplicate symbols in both library files. Is this necessary, i.e., who might still be using the old symbols?

The PR requires torch version >=2.10 in order to build _C_stable, I'm wondering whether that is alright

We could do this after the release of PyTorch 2.10. I think that is about ~1 month later. And during this period we could run build and test on nightly pytorch.

mikaylagawarecki · 2025-12-19T16:35:33Z

Hi @Harry-Chen,

Thank you for your feedback!

But unless we could get rid of all of them, this migration will not take actual effect.

I think you are right here. The rationale for starting this with an initial small PR is per youkaichao's feedback on the initial issue

Is it possible to use stable APIs gradually? Like gradually enable it in some files. If we have to do it once or nothing, I'm afraid that would be a very large PR (similar to the FA3 one) and never get merged (vLLM's kernel changes are quite frequent).

Indeed, the final switch of building a libtorch ABI stable wheel would have to wait till all the relevant files are only using stable APIs, but the progress does not need to be all or nothing. Does it make sense to you if I continue this enablement for the files in the CUDA wheel with a stack that migrates files one by one?

What about a name like _C_stable_libtorch, just for more clarity?

👍 Will fix!

IIUC this will lead to duplicate symbols in both library files. Is this necessary, i.e., who might still be using the old symbols?

To clarify what I meant here, I meant that the C_stable target uses a TORCH_LIBRARY_FRAGMENT with the namespace _C. This ensures that the function is still callable from python as (e.g. torch.ops._C.silu_and_mul).

I don't think there are duplicate symbols because I removed the registrations from the respective TORCH_LIBRARY in the _C target

We could do this after the release of PyTorch 2.10. I think that is about ~1 month later. And during this period we could run build and test on nightly pytorch

That sounds good! Is there a guide on how to enable this in CI?

Harry-Chen · 2025-12-19T16:49:37Z

@mikaylagawarecki

Indeed, the final switch of building a libtorch ABI stable wheel would have to wait till all the relevant files are only using stable APIs, but the progress does not need to be all or nothing. Does it make sense to you if I continue this enablement with a stack that migrates files one by one?

Yes this totally makes sense. So our target is to move all c++ compilation units to VLLM_STABLE_EXT_SRC. Do you have any evaluation on this, e.g. can we finish this with PyTorch 2.10, or will we be blocked by some APIs that do not exist yet?

To clarify what I meant here, I meant that the stable library in C_stable registers a TORCH_LIBRARY_FRAGMENT with the namespace _C. This ensures that the function is still callable from python as torch._C.silu_and_mul.

I don't think there are duplicate symbols because I removed the registrations from the respective TORCH_LIBRARY in the _C target

I got it now. Thank you for explaining!

We could do this after the release of PyTorch 2.10. I think that is about ~1 month later. And during this period we could run build and test on nightly pytorch

That sounds good! Is there a guide on how to enable this in CI?

Maybe you can refer to https://docs.vllm.ai/en/latest/contributing/ci/update_pytorch_version/#test-pytorch-release-candidates-rcs. And CC @youkaichao and @khluu for more ideas here.

mikaylagawarecki · 2025-12-19T17:18:25Z

Yes this totally makes sense. So our target is to move all c++ compilation units to VLLM_STABLE_EXT_SRC. Do you have any evaluation on this, e.g. can we finish this with PyTorch 2.10, or will we be blocked by some APIs that do not exist yet?

We had gone over the APIs in the CUDA build of vllm before and I believe we should have what is needed for vllm/_C.abi3.so, vllm/_moe_C.abi3.so and vllm/cumem_allocator.abi3.so in torch 2.10 (note this doesn't include rocm wheel or cpu wheel etc.). That said there might be unknown-unknowns (e.g. kernels that changed or APIs that were missed). So I cannot say this with 100% confidence until I migrate the kernels (which I intend to do asap :))

Would you be in objection of the partial migration if we find that there are a few kernels that must be unstable for 2.10?

I am aware that vllm also has dependencies on FlashMLA, qutlass and flash attention2/3. Would migrating these be a prerequisite? Currently only fa3 is on torch's stable abi

Harry-Chen · 2025-12-20T03:04:30Z

We had gone over the APIs in the CUDA build of vllm before and I believe we should have what is needed for vllm/_C.abi3.so, vllm/_moe_C.abi3.so and vllm/cumem_allocator.abi3.so in torch 2.10 (note this doesn't include rocm wheel or cpu wheel etc.). That said there might be unknown-unknowns (e.g. kernels that changed or APIs that were missed). So I cannot say this with 100% confidence until I migrate the kernels (which I intend to do asap :))

Would you be in objection of the partial migration if we find that there are a few kernels that must be unstable for 2.10?

Of course not, any progress would be good, even though not 100%, as long as we are going towards it.

I am aware that vllm also has dependencies on FlashMLA, qutlass and flash attention2/3. Would migrating these be a prerequisite? Currently only fa3 is on torch's stable abi

The ideal situation is that everything gets migrated. Please note the flashmla and flashattention we use in vllm are forks (under vllm-project on GitHub). So this might require extra work (either cherry-picking upstream's changes, rebasing, or we do it on our own). You could also evaluate the amount of work required to do so, and we can discuss later.

mergify · 2026-01-15T22:28:43Z

Documentation preview: https://vllm--30908.org.readthedocs.build/en/30908/

mergify · 2026-01-16T22:14:33Z

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-16T23:01:46Z

Hi @mikaylagawarecki, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-11T03:39:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Cherry-picked from temp4 996b0bb with the following additional changes: - Merged M6 Blackwell 256-bit vectorization (PackedTraits, ld256/st256, cc_major branching) with stable ABI APIs (DeviceGuard, get_current_cuda_stream, mutable_data_ptr/const_data_ptr, VLLM_STABLE_DISPATCH_FLOATING_TYPES) - Added get_device_prop() cached device properties utility to csrc/stable/torch_utils.h for stable ABI compatible device queries - Added explicit cuda_bf16.h/cuda_fp16.h includes for packed type intrinsics (hip equivalents on ROCm) - Replaced c10::BFloat16/c10::Half with torch::headeronly:: equivalents - ROCm fix: cuda_compat.h -> ../cuda_compat.h in stable/activation_kernels.cu - ROCm fix: import _C_stable_libtorch in vllm/platforms/rocm.py Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki · 2026-02-24T02:12:20Z

cmake/utils.cmake

    target_link_libraries(${MOD_NAME} PRIVATE torch CUDA::cudart CUDA::cuda_driver ${ARG_LIBRARIES})
  else()
-    target_link_libraries(${MOD_NAME} PRIVATE torch ${TORCH_LIBRARIES} ${ARG_LIBRARIES})
+    # Link against PyTorch's bundled libtorch_hip.so (for DeviceGuard registration)


On a RoCM machine, I found that I needed these changes to make sure that vllm was linking to these two .so from torch (in particular the libamd.so hip headers packaged by torch) otherwise the stable DeviceGuard would not work correctly

It seemed that 2 separate hipContexts were created by the raw hip calls that vllm did (which were from a hip header from elsewhere) and the libtorch shims that called raw hip APIs, which used the hip headers packaged by torch

@mikaylagawarecki I'm not sure what's going on here. Do all extensions need this code? Or is vLLM doing something weird?

mikaylagawarecki · 2026-02-24T02:15:27Z

csrc/ops.h

    at::Tensor& y_s,           // (E, T, H//group_size) [OUT]
    bool use_ue8m0);

-void mul_and_silu(torch::Tensor& out, torch::Tensor& input);


Since it might be confusing why this is deleted -- I kept the other declarations because the cpu torch_bindings.cpp includes ops.h. All the other ops are also defined for CPU as well, but this op isn't defined for CPU so its declaration is deleted

mergify · 2026-02-28T04:26:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zou3519 · 2026-03-09T13:38:41Z

CMakeLists.txt

+  # Set TORCH_TARGET_VERSION for stable ABI compatibility.
+  # This ensures we only use C-shim APIs available in PyTorch 2.10+.
+  target_compile_definitions(_C_stable_libtorch PRIVATE
+    TORCH_TARGET_VERSION=0x020A000000000000ULL)


nit: comment that "_C_stable_libtorch is abi compatible with PyTorch >= TORCH_TARGET_VERSION which is currently set to 2.10". (explicitly state what we get)

zou3519 · 2026-03-09T13:41:49Z

csrc/stable/dispatch_utils.h

@@ -0,0 +1,25 @@
+/*
+ * Stable ABI compatible dispatch utilities for vLLM.


nit: Whenever you add "stable abi" in a comment you probably want to call it "libtorch stable abi" to clarify

zou3519 · 2026-03-09T13:42:46Z

csrc/stable/torch_bindings.cpp

+STABLE_TORCH_LIBRARY_FRAGMENT(_C, m) {
+  // Activation ops
+  // Activation function used in SwiGLU.
+  m.def("silu_and_mul(Tensor! result, Tensor input) -> ()");


Just to check... you are able to add tags to operators in the stable abi?

Do you mean adding at::Tag to the registration? that is currently not ABI stable (but I don't believe at::Tag is used within the repo atm)

zou3519 · 2026-03-09T14:04:36Z

csrc/stable/torch_utils.h

+#include <mutex>
+#include <string>
+#include <vector>
+


there's no namespacing at all, is that intentional?

looks like vllm/csrc
/ops.h has no namespacing in the first place lol

mikaylagawarecki requested review from LucasWilkinson and tlrmchlsmth as code owners December 17, 2025 22:49

mergify bot added ci/build nvidia labels Dec 17, 2025

github-project-automation bot added this to NVIDIA Dec 17, 2025

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

csrc/activation_kernels.cu Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 17, 2025

View reviewed changes

csrc/ops.h Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

mikaylagawarecki requested a review from bigPYJ1151 as a code owner December 17, 2025 23:19

mergify bot added the cpu Related to CPU backends label Dec 17, 2025

mikaylagawarecki mentioned this pull request Dec 18, 2025

[RFC]: Enable libtorch-ABI-stable vLLM cuda wheels #26946

Open

22 tasks

mikaylagawarecki force-pushed the torch_stable_abi branch from 4331456 to 13423c9 Compare December 29, 2025 22:20

mikaylagawarecki mentioned this pull request Dec 29, 2025

[1/n] Migrate permute_cols to libtorch stable ABI #31509

Merged

5 tasks

mikaylagawarecki changed the title ~~Migrate activation kernels to libtorch stable ABI~~ [1/n] Migrate activation kernels to libtorch stable ABI Dec 29, 2025

Harry-Chen mentioned this pull request Jan 15, 2026

[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests #30443

Merged

mikaylagawarecki force-pushed the torch_stable_abi branch 3 times, most recently from f6f00f5 to f57fbaa Compare January 15, 2026 22:28

mikaylagawarecki requested review from ProExpertProg, hmellor, tjtanaa, yewentao256, youkaichao and zou3519 as code owners January 15, 2026 22:28

mikaylagawarecki force-pushed the torch_stable_abi branch 2 times, most recently from 208d7e5 to 5c06dc5 Compare January 16, 2026 22:08

mikaylagawarecki force-pushed the torch_stable_abi branch from 5c06dc5 to 13b890d Compare January 16, 2026 22:56

mikaylagawarecki force-pushed the torch_stable_abi branch from 13b890d to 996b0bb Compare February 2, 2026 19:54

mikaylagawarecki force-pushed the torch_stable_abi branch 2 times, most recently from a97c8ed to c913760 Compare February 10, 2026 00:44

mikaylagawarecki mentioned this pull request Feb 10, 2026

[Release 2.10] Update to Torch 2.10 - final release #30525

Merged

mikaylagawarecki force-pushed the torch_stable_abi branch 2 times, most recently from 4b3a51d to 8c5be98 Compare February 10, 2026 04:35

mergify bot added the needs-rebase label Feb 11, 2026

github-project-automation bot added this to AMD Feb 11, 2026

github-project-automation bot moved this to Todo in AMD Feb 11, 2026

mikaylagawarecki force-pushed the torch_stable_abi branch from 8c5be98 to 32f841a Compare February 24, 2026 00:09

mergify bot removed the needs-rebase label Feb 24, 2026

mikaylagawarecki force-pushed the torch_stable_abi branch from 32f841a to 39e1828 Compare February 24, 2026 01:20

mikaylagawarecki mentioned this pull request Feb 24, 2026

[Bug]: AMD docker image still using torch 2.9 despite 2.10.0 in requirements/rocm-build.txt #35163

Open

1 task

Fix libamd that is linked to

e8d1957

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki force-pushed the torch_stable_abi branch from 39e1828 to e8d1957 Compare February 24, 2026 02:05

mikaylagawarecki commented Feb 24, 2026

View reviewed changes

mergify bot added the needs-rebase label Feb 28, 2026

zou3519 reviewed Mar 9, 2026

View reviewed changes

		@@ -0,0 +1,25 @@
		/*
		* Stable ABI compatible dispatch utilities for vLLM.

Uh oh!

Conversation

mikaylagawarecki commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mikaylagawarecki commented Dec 18, 2025

Uh oh!

Harry-Chen commented Dec 19, 2025

Uh oh!

mikaylagawarecki commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Harry-Chen commented Dec 19, 2025

Uh oh!

mikaylagawarecki commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Harry-Chen commented Dec 20, 2025

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

mergify bot commented Jan 16, 2026

Uh oh!

mergify bot commented Jan 16, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

mikaylagawarecki Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 28, 2026

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikaylagawarecki commented Dec 17, 2025 •

edited by github-actions bot

Loading

mikaylagawarecki commented Dec 19, 2025 •

edited

Loading

mikaylagawarecki commented Dec 19, 2025 •

edited

Loading

mikaylagawarecki Feb 24, 2026 •

edited

Loading

mikaylagawarecki Feb 24, 2026 •

edited

Loading