Support compile sgl-kernel on cuda 13.0 #9721

rainj-me · 2025-08-27T23:02:34Z

Motivation

Support cuda130 with custom flashinfer and trtllm kernel Aug 25 2025
Support sm_110 and sm_121 on cuda 130
Support --compress-mode=size on cuda 130
Keep sm_101 support on cuda 128/129

Test

Step 1, use nvidia pytorch 25.08 image

docker pull nvcr.io/nvidia/pytorch:25.08-py3

Step 2, run container with bash
Step 3, git clone the change
Step 4, comment out the torch dependency in sgl-kernel/pyproject.toml, patch like

diff --git a/sgl-kernel/pyproject.toml b/sgl-kernel/pyproject.toml
index 52ee620e4..177e49e57 100644
--- a/sgl-kernel/pyproject.toml
+++ b/sgl-kernel/pyproject.toml
@@ -1,7 +1,7 @@
 [build-system]
 requires = [
   "scikit-build-core>=0.10",
-  "torch>=2.8.0",
+  # "torch>=2.8.0",
   "wheel",
 ]

Step 5, patch the python/pyproject.toml with the torch version from the container, in my container the patch is like

diff --git a/python/pyproject.toml b/python/pyproject.toml
index c23efbc2e..b29789d45 100644
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -49,7 +49,7 @@ runtime_common = [
     "scipy",
     "timm==1.0.16",
     "tiktoken",
-    "torchao==0.9.0",
+    "torchao==0.12.0+git",
     "transformers==4.55.2",
     "uvicorn",
     "uvloop",
@@ -59,21 +59,19 @@ runtime_common = [
 srt = [
     "sglang[runtime_common]",
     "sgl-kernel==0.3.5",
-    "torch==2.8.0",
-    "torchaudio==2.8.0",
+    "torch==2.8.0a0+34c6371d24.nv25.8",
     "torchvision",
     "cuda-python",
-    "flashinfer_python==0.2.11.post3",
+    "flashinfer_python==0.2.14.post1",
 ]
 
 blackwell = [
     "sglang[runtime_common]",
     "sgl-kernel",
-    "torch==2.8.0",
-    "torchaudio==2.8.0",
+    "torch==2.8.0a0+34c6371d24.nv25.8",
     "torchvision",
     "cuda-python",
-    "flashinfer_python==0.2.11.post3",
+    "flashinfer_python==0.2.14.post1",
 ]

Step 6, install sgl-kernel

CUDA_VERSION=13.0 CMAKE_BUILD_PARALLEL_LEVEL="$(nproc)" SKBUILD_BUILD_DIR=./build CMAKE_ARGS="-DCMAKE_POLICY_VERSION_MINIMUM=3.5"  pip install -v .

Step 7, install sglang
Step 8, git clone flashinfer with commit 018b5518 and install with editable pkg

Modifications

use custom flashinfer to support cuda130 and load trtllm kernel Aug 25 2025
fix cub::Sum cub::Max issue and let it support both cuda12x and 130
use torch 2.8.x and cuda 130

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @rainj-me, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for CUDA 13.0, enabling compatibility with newer NVIDIA GPU architectures and features. It updates the build system to incorporate specific CUDA 13.0 flags and integrates a custom version of FlashInfer. Additionally, it refactors the Marlin MoE kernel compilation process for improved organization and addresses CUB library compatibility issues with CUDA 13.0 by adopting newer CUDA functional APIs for reduction operations.

Highlights

CUDA 13.0 Compatibility: This PR introduces comprehensive support for CUDA 13.0, enabling the project to leverage the latest NVIDIA GPU architectures and features. This includes adding specific sm_110 and sm_121 compute capabilities and the --compress-mode=size flag for CUDA 13.0 builds, while ensuring backward compatibility for older CUDA versions.
FlashInfer Dependency Update: The FlashInfer dependency has been updated to a specific commit (018b551825c8e5579206e6eb9d3229fa679202b3), indicating the integration of a custom or updated version of FlashInfer that is likely optimized or required for CUDA 13.0 compatibility.
Marlin MoE Kernel Refactoring: The Marlin MoE kernel compilation process has been refactored. Individual .cu kernel files are now renamed to .cuh and consolidated into a single generated header file, kernel_marlin.cuh. This streamlines the build process and improves code organization by centralizing kernel includes.
CUB API Modernization for CUDA 13.0: To maintain compatibility and address deprecations in CUDA 13.0, the moe_topk_softmax_kernels.cu file has been updated to use cuda::functional APIs (cuda::maximum, cuda::minimum) for reduction operations when building with CUDA 13.0 or newer. This ensures the code remains robust with the latest CUDA toolkit.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for CUDA 13.0. The changes include updating the flashinfer dependency, adding new CUDA architecture flags for compilation, and refactoring the marlin MoE kernels to be included as headers. It also addresses API deprecations in CUB for CUDA 13.0. The changes look good and are well-motivated. My feedback focuses on improving the maintainability and portability of the build scripts and code comments.

sgl-kernel/CMakeLists.txt

gemini-code-assist · 2025-08-27T23:05:18Z

sgl-kernel/csrc/moe/marlin_moe_wna16/generate_kernels.py

 def remove_old_kernels():
-    for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cu"):
+    for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cuh"):
        subprocess.call(["rm", "-f", filename])


Using subprocess.call with rm is not platform-independent. It's better to use Python's os module for file operations to ensure portability. os.path.join should also be used for constructing paths. This change improves the robustness and portability of the script.

Suggested change

def remove_old_kernels():

for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cu"):

for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cuh"):

subprocess.call(["rm", "-f", filename])

def remove_old_kernels():

for filename in glob.glob(os.path.join(os.path.dirname(__file__), "kernel_*.cuh")):

try:

os.remove(filename)

except OSError as e:

print(f"Error removing file {filename}: {e}")

gemini-code-assist · 2025-08-27T23:05:18Z

sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu

+// Define reduction operators based on CUDA version
+// CUDA 13 (12.9+) deprecated cub::Max/Min in favor of cuda::maximum/minimum


The comment explaining the CUDA version check is a bit unclear. The version 12090 is likely for a pre-release of CUDA 13.0. A more explicit comment would improve clarity for future maintainers.

// Define reduction operators based on CUDA version. // In CUDA 13.0, cub::Max/Min were deprecated in favor of cuda::maximum/minimum. // The version check for 12090 handles pre-release versions of CUDA 13.0.

FlamingoPg

LGTM, @zhyncs any other comment?

FlamingoPg · 2025-08-28T14:27:11Z

@rainj-me need rebase master

rainj-me added 4 commits August 27, 2025 14:52

support cuda 13.0 and cuda 12.8

7ca9592

update the nvcc flags and the arch 110 121 support on cuda 130

76e1414

rollback deep_gemm and add B300 suppor

ca860c1

move the sm103 to support only on cuda 13.0

b02c997

rainj-me requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners August 27, 2025 23:02

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

rainj-me changed the title ~~Dev/support cuda130~~ Support compile sgl-kernel on cuda 13.0 Aug 27, 2025

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

zhyncs self-assigned this Aug 27, 2025

rainj-me added 2 commits August 27, 2025 16:36

Merge branch 'main' into dev/support_cuda130

fa1615b

Merge branch 'main' into dev/support_cuda130

9c4953e

FlamingoPg approved these changes Aug 28, 2025

View reviewed changes

Merge branch 'main' into dev/support_cuda130

e879f06

rainj-me merged commit 6b39f9c into sgl-project:main Aug 28, 2025
42 of 56 checks passed

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Support compile sgl-kernel on cuda 13.0 (sgl-project#9721)

f7c07c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support compile sgl-kernel on cuda 13.0 #9721

Support compile sgl-kernel on cuda 13.0 #9721

Uh oh!

rainj-me commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Aug 27, 2025

Uh oh!

gemini-code-assist bot Aug 27, 2025

Uh oh!

FlamingoPg left a comment

Uh oh!

FlamingoPg commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Define reduction operators based on CUDA version
		// CUDA 13 (12.9+) deprecated cub::Max/Min in favor of cuda::maximum/minimum

Support compile sgl-kernel on cuda 13.0 #9721

Support compile sgl-kernel on cuda 13.0 #9721

Uh oh!

Conversation

rainj-me commented Aug 27, 2025

Motivation

Test

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

FlamingoPg left a comment

Choose a reason for hiding this comment

Uh oh!

FlamingoPg commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants