Skip to content

Conversation

@rainj-me
Copy link
Collaborator

Motivation

#9490
PR 9495

  • Support cuda130 with custom flashinfer and trtllm kernel Aug 25 2025
  • Support sm_110 and sm_121 on cuda 130
  • Support --compress-mode=size on cuda 130
  • Keep sm_101 support on cuda 128/129

Test

  • Step 1, use nvidia pytorch 25.08 image
docker pull nvcr.io/nvidia/pytorch:25.08-py3
  • Step 2, run container with bash
  • Step 3, git clone the change
  • Step 4, comment out the torch dependency in sgl-kernel/pyproject.toml, patch like
diff --git a/sgl-kernel/pyproject.toml b/sgl-kernel/pyproject.toml
index 52ee620e4..177e49e57 100644
--- a/sgl-kernel/pyproject.toml
+++ b/sgl-kernel/pyproject.toml
@@ -1,7 +1,7 @@
 [build-system]
 requires = [
   "scikit-build-core>=0.10",
-  "torch>=2.8.0",
+  # "torch>=2.8.0",
   "wheel",
 ]
  • Step 5, patch the python/pyproject.toml with the torch version from the container, in my container the patch is like
diff --git a/python/pyproject.toml b/python/pyproject.toml
index c23efbc2e..b29789d45 100644
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -49,7 +49,7 @@ runtime_common = [
     "scipy",
     "timm==1.0.16",
     "tiktoken",
-    "torchao==0.9.0",
+    "torchao==0.12.0+git",
     "transformers==4.55.2",
     "uvicorn",
     "uvloop",
@@ -59,21 +59,19 @@ runtime_common = [
 srt = [
     "sglang[runtime_common]",
     "sgl-kernel==0.3.5",
-    "torch==2.8.0",
-    "torchaudio==2.8.0",
+    "torch==2.8.0a0+34c6371d24.nv25.8",
     "torchvision",
     "cuda-python",
-    "flashinfer_python==0.2.11.post3",
+    "flashinfer_python==0.2.14.post1",
 ]
 
 blackwell = [
     "sglang[runtime_common]",
     "sgl-kernel",
-    "torch==2.8.0",
-    "torchaudio==2.8.0",
+    "torch==2.8.0a0+34c6371d24.nv25.8",
     "torchvision",
     "cuda-python",
-    "flashinfer_python==0.2.11.post3",
+    "flashinfer_python==0.2.14.post1",
 ]
  • Step 6, install sgl-kernel
CUDA_VERSION=13.0 CMAKE_BUILD_PARALLEL_LEVEL="$(nproc)" SKBUILD_BUILD_DIR=./build CMAKE_ARGS="-DCMAKE_POLICY_VERSION_MINIMUM=3.5"  pip install -v .

Modifications

  • use custom flashinfer to support cuda130 and load trtllm kernel Aug 25 2025
  • fix cub::Sum cub::Max issue and let it support both cuda12x and 130
  • use torch 2.8.x and cuda 130

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rainj-me, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for CUDA 13.0, enabling compatibility with newer NVIDIA GPU architectures and features. It updates the build system to incorporate specific CUDA 13.0 flags and integrates a custom version of FlashInfer. Additionally, it refactors the Marlin MoE kernel compilation process for improved organization and addresses CUB library compatibility issues with CUDA 13.0 by adopting newer CUDA functional APIs for reduction operations.

Highlights

  • CUDA 13.0 Compatibility: This PR introduces comprehensive support for CUDA 13.0, enabling the project to leverage the latest NVIDIA GPU architectures and features. This includes adding specific sm_110 and sm_121 compute capabilities and the --compress-mode=size flag for CUDA 13.0 builds, while ensuring backward compatibility for older CUDA versions.
  • FlashInfer Dependency Update: The FlashInfer dependency has been updated to a specific commit (018b551825c8e5579206e6eb9d3229fa679202b3), indicating the integration of a custom or updated version of FlashInfer that is likely optimized or required for CUDA 13.0 compatibility.
  • Marlin MoE Kernel Refactoring: The Marlin MoE kernel compilation process has been refactored. Individual .cu kernel files are now renamed to .cuh and consolidated into a single generated header file, kernel_marlin.cuh. This streamlines the build process and improves code organization by centralizing kernel includes.
  • CUB API Modernization for CUDA 13.0: To maintain compatibility and address deprecations in CUDA 13.0, the moe_topk_softmax_kernels.cu file has been updated to use cuda::functional APIs (cuda::maximum, cuda::minimum) for reduction operations when building with CUDA 13.0 or newer. This ensures the code remains robust with the latest CUDA toolkit.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@rainj-me rainj-me changed the title Dev/support cuda130 Support compile sgl-kernel on cuda 13.0 Aug 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for CUDA 13.0. The changes include updating the flashinfer dependency, adding new CUDA architecture flags for compilation, and refactoring the marlin MoE kernels to be included as headers. It also addresses API deprecations in CUB for CUDA 13.0. The changes look good and are well-motivated. My feedback focuses on improving the maintainability and portability of the build scripts and code comments.

Comment on lines 62 to 64
def remove_old_kernels():
for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cu"):
for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cuh"):
subprocess.call(["rm", "-f", filename])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using subprocess.call with rm is not platform-independent. It's better to use Python's os module for file operations to ensure portability. os.path.join should also be used for constructing paths. This change improves the robustness and portability of the script.

Suggested change
def remove_old_kernels():
for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cu"):
for filename in glob.glob(os.path.dirname(__file__) + "/kernel_*.cuh"):
subprocess.call(["rm", "-f", filename])
def remove_old_kernels():
for filename in glob.glob(os.path.join(os.path.dirname(__file__), "kernel_*.cuh")):
try:
os.remove(filename)
except OSError as e:
print(f"Error removing file {filename}: {e}")

Comment on lines +37 to +38
// Define reduction operators based on CUDA version
// CUDA 13 (12.9+) deprecated cub::Max/Min in favor of cuda::maximum/minimum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment explaining the CUDA version check is a bit unclear. The version 12090 is likely for a pre-release of CUDA 13.0. A more explicit comment would improve clarity for future maintainers.

// Define reduction operators based on CUDA version.
// In CUDA 13.0, cub::Max/Min were deprecated in favor of cuda::maximum/minimum.
// The version check for 12090 handles pre-release versions of CUDA 13.0.

@zhyncs zhyncs self-assigned this Aug 27, 2025
Copy link
Collaborator

@FlamingoPg FlamingoPg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @zhyncs any other comment?

@FlamingoPg
Copy link
Collaborator

@rainj-me need rebase master

@rainj-me rainj-me merged commit 6b39f9c into sgl-project:main Aug 28, 2025
42 of 56 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants