[MUSA][8/N] Port CUDA kernels that are compatible with MUSA by yafengio · Pull Request #17946 · sgl-project/sglang

yafengio · 2026-01-29T14:25:20Z

Motivation

This PR continues the ongoing effort (tracked in #16565) to add full support for Moore Threads GPUs in SGLang by leveraging MUSA (Meta-computing Unified System Architecture) for LLM inference.

The primary goal of this submission is to enable core kernel functionality on MUSA by porting CUDA kernels that are compatible with the MUSA programming model, while keeping the codebase unified across CUDA, ROCm, and MUSA backends.

What’s Changed

This PR focuses on the following areas:

1. CUDA Kernel Porting to MUSA

Ported a set of CUDA kernels that are compatible with MUSA to native MUSA implementations.
Covered kernel categories include:
- Custom All Reduce
- Elementwise operations
- GEMM
- MoE and fused MoE
- KV cache and memory utilities
- Speculative decoding
- Quantization (GGUF)
Conditional compilation via USE_MUSA is used to preserve multi-backend compatibility.

2. Custom AllReduce for MUSA

Added a MUSA-specific implementation of custom_all_reduce_2shot.
Uses MUSA-native synchronization and memory semantics instead of reusing CUDA/NCCL paths.
Provides a foundation for multi-GPU inference on Moore Threads hardware.

3. Build System Integration

Added and updated MUSA-specific build configuration:
- setup_musa.py
- pyproject_musa.toml
Updated dependencies to ensure compatibility (e.g. bumped torchada).
Verified that sgl-kernel can be built and installed in a clean torch_musa container.

Testing Done

Tested in a clean torch_musa container.

<===Click to expand log details===>

root@worker:/ws/sgl-kernel# python setup_musa.py install

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/local/lib/python3.10/dist-packages/torch_musa/utils/musa_extension.py:757: UserWarning: TORCH_MUSA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_MUSA_ARCH_LIST'].
  warnings.warn(

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/local/lib/python3.10/dist-packages/setuptools_scm/_integration/setuptools.py:24: RuntimeWarning: 
ERROR: setuptools==59.6.0 is used in combination with setuptools-scm>=8.x

Your build configuration is incomplete and previously worked by accident!
setuptools-scm requires setuptools>=61 (recommended: >=80)

Suggested workaround if applicable:
 - migrating from the deprecated setup_requires mechanism to pep517/518
   and using a pyproject.toml to declare build dependencies
   which are reliably pre-installed before running the build tools

  warnings.warn(

running install
2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/lib/python3/dist-packages/setuptools/command/easy_install.py:158: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(

running bdist_egg
running egg_info
writing python/sgl_kernel.egg-info/PKG-INFO
writing dependency_links to python/sgl_kernel.egg-info/dependency_links.txt
writing top-level names to python/sgl_kernel.egg-info/top_level.txt
adding license file 'LICENSE'
writing manifest file 'python/sgl_kernel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.10
creating build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/attention.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/spatial.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/gemm.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/sampling.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/sparse_flash_attn.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/cutlass_moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/test_utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/memory.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/fused_moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/elementwise.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/flash_mla.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/_fa4_interface.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/expert_specialization.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/marlin.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/flash_attn.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/kvcacheio.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/hadamard.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/version.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/mamba.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/top_k.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/allreduce.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/load_utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/grammar.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/speculative.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/scalar_type.py -> build/lib.linux-x86_64-3.10/sgl_kernel
creating build/lib.linux-x86_64-3.10/sgl_kernel/testing
copying python/sgl_kernel/testing/rotary_embedding.py -> build/lib.linux-x86_64-3.10/sgl_kernel/testing
copying python/sgl_kernel/testing/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel/testing
creating build/lib.linux-x86_64-3.10/sgl_kernel/quantization
copying python/sgl_kernel/quantization/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel/quantization
copying python/sgl_kernel/quantization/gguf.py -> build/lib.linux-x86_64-3.10/sgl_kernel/quantization
running build_ext
Cloning third-party repositories...
Third-party repositories ready.
building 'sgl_kernel.common_ops' extension
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/allreduce_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/attention_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/elementwise_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/gemm_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/grammar_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/kvcacheio_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/mamba_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/memory_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/moe_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/quantization
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/quantization/gguf_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/speculative_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party/flashinfer
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party/flashinfer/csrc_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/csrc
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/csrc/elementwise
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/attention.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/spatial.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/gemm.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/sampling.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/sparse_flash_attn.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/cutlass_moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/test_utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/memory.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/fused_moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/elementwise.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/flash_mla.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/_fa4_interface.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/expert_specialization.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/marlin.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/flash_attn.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/kvcacheio.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/hadamard.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/version.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/mamba.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/top_k.py -> build/bdist.linux-x86_64/egg/sgl_kernel
creating build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/testing/rotary_embedding.py -> build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/testing/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/allreduce.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/common_ops.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/load_utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/grammar.py -> build/bdist.linux-x86_64/egg/sgl_kernel
creating build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/quantization/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/quantization/gguf.py -> build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/speculative.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/scalar_type.py -> build/bdist.linux-x86_64/egg/sgl_kernel
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/attention.py to attention.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/spatial.py to spatial.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/gemm.py to gemm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/sampling.py to sampling.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/sparse_flash_attn.py to sparse_flash_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/cutlass_moe.py to cutlass_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/test_utils.py to test_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/memory.py to memory.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/fused_moe.py to fused_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/elementwise.py to elementwise.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/flash_mla.py to flash_mla.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/_fa4_interface.py to _fa4_interface.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/expert_specialization.py to expert_specialization.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/marlin.py to marlin.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/flash_attn.py to flash_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/kvcacheio.py to kvcacheio.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/hadamard.py to hadamard.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/version.py to version.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/mamba.py to mamba.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/top_k.py to top_k.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/testing/rotary_embedding.py to rotary_embedding.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/testing/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/moe.py to moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/allreduce.py to allreduce.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/load_utils.py to load_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/grammar.py to grammar.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/quantization/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/quantization/gguf.py to gguf.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/speculative.py to speculative.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/scalar_type.py to scalar_type.cpython-310.pyc
creating stub loader for sgl_kernel/common_ops.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/common_ops.py to common_ops.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
creating dist
creating 'dist/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
removing '/usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
Extracting sgl_kernel-0.3.21-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
sgl-kernel 0.3.21 is already the active version in easy-install.pth

Installed /usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
Processing dependencies for sgl-kernel==0.3.21
Finished processing dependencies for sgl-kernel==0.3.21

Launch the server:

export MUSA_VISIBLE_DEVICES=0,1
python3 -m sglang.launch_server \
    --model-path /home/dist/Qwen3-0.6B/ \
    --served-model-name base-model \
    --disable-radix-cache \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 31000 \
    --tp-size 2 \
    --dp-size 2 \
    --enable-dp-attention \
    --cuda-graph-max-bs 10 \

   # --sampling-backend flashinfer \   # Note: keep it disabled for now.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-29T14:25:51Z

Summary of Changes

Hello @yafengio, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the support for Moore Threads GPUs by porting a wide array of existing CUDA kernels to the MUSA architecture. The changes introduce MUSA-specific implementations for critical operations like all-reduce and fused normalization, ensuring optimal performance and compatibility. This work is a crucial step towards enabling efficient Large Language Model (LLM) inference on Moore Threads hardware, maintaining a unified codebase through conditional compilation.

Highlights

MUSA Kernel Porting: Many core CUDA kernels across various functionalities (allreduce, attention, elementwise, GEMM, MoE, etc.) have been ported to be compatible with the MUSA architecture, enabling execution on Moore Threads GPUs.
Custom AllReduce Implementation: An optimized distributed communication primitive, custom_all_reduce_2shot, has been implemented specifically for MUSA's hardware characteristics, including MUSA-specific synchronization primitives and memory operations.
Multi-Platform Compatibility: Extensive use of USE_MUSA preprocessor guards has been added throughout the codebase to ensure compatibility and conditional compilation for CUDA, ROCm, and MUSA backends.
Build System Integration: The build system (setup_musa.py and pyproject_musa.toml) has been updated to include the new MUSA-specific source files and adjust dependencies, such as torchada.
Fused Add RMS Norm Kernel for MUSA: A new MUSA-specific kernel (fused_add_rms_norm_kernel.mu) has been added to perform fused add RMS normalization, leveraging MUSA's vectorized types and optimized memory access.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request ports a significant number of CUDA kernels to the MUSA architecture, enabling GPU acceleration on Moore Threads hardware. The changes include adding MUSA-specific implementations using preprocessor guards, updating build configurations, and introducing a custom AllReduce implementation for MUSA. The porting effort is extensive and seems well-executed. I've found one critical issue related to a missing kernel dispatch path for the float data type and a few minor maintainability issues. Overall, great work on this large porting task.

yeahdongcn · 2026-01-30T02:48:54Z

/tag-and-rerun-ci

yeahdongcn · 2026-02-01T02:20:03Z

Please also rebase onto upstream/main, as #18035 has been merged. Thanks!

yafengio · 2026-04-14T10:55:56Z

/rerun-failed-ci

yafengio · 2026-04-14T12:50:05Z

/rerun-failed-ci

yafengio · 2026-04-14T15:22:06Z

/rerun-failed-ci

yafengio · 2026-04-15T02:14:29Z

/rerun-failed-ci

yafengio · 2026-04-15T05:54:09Z

/rerun-failed-ci

yafengio · 2026-04-16T00:40:44Z

/rerun-failed-ci

yafengio · 2026-04-16T10:41:02Z

/rerun-failed-ci

yafengio · 2026-04-17T00:09:25Z

/rerun-failed-ci

yeahdongcn · 2026-04-18T01:47:59Z

/rerun-failed-ci

yafengio · 2026-04-20T02:26:20Z

/rerun-failed-ci

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

yafengio · 2026-04-21T02:52:48Z

/rerun-failed-ci

alexnails · 2026-04-21T23:43:06Z

Is this just CI Flakiness or what are issues for merging this?

yeahdongcn · 2026-04-22T02:23:40Z

Is this just CI Flakiness or what are issues for merging this?

We've been unable to get all NVIDIA CI checks to pass despite multiple attempts. Could you please help merge this? Thanks!

yeahdongcn · 2026-04-23T07:05:10Z

/rerun-failed-ci

yafengio · 2026-04-23T10:35:11Z

/rerun-failed-ci

yafengio · 2026-04-23T12:38:36Z

/rerun-failed-ci

…ect#17946) Signed-off-by: yafeng.li <yafeng.li@mthreads.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai>

yafengio requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners January 29, 2026 14:25

github-actions Bot added quant LLM Quantization dependencies Pull requests that update a dependency file sgl-kernel labels Jan 29, 2026

gemini-code-assist Bot reviewed Jan 29, 2026

View reviewed changes

yeahdongcn mentioned this pull request Jan 30, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

yeahdongcn added the mthreads label Jan 30, 2026

yeahdongcn reviewed Jan 30, 2026

View reviewed changes

yafengio force-pushed the feat/port-cuda-kernels2 branch 4 times, most recently from 0d68ad3 to 607c88f Compare January 30, 2026 02:11

yeahdongcn reviewed Jan 30, 2026

View reviewed changes

Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated

Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated

Comment thread sgl-kernel/csrc/gemm/dsv3_fused_a_gemm.cu Outdated

yafengio force-pushed the feat/port-cuda-kernels2 branch from 607c88f to 3dd4964 Compare January 30, 2026 02:42

yeahdongcn approved these changes Jan 30, 2026

View reviewed changes

github-actions Bot added the run-ci label Jan 30, 2026

yafengio marked this pull request as draft January 30, 2026 13:04

yafengio force-pushed the feat/port-cuda-kernels2 branch 3 times, most recently from 30bfc6c to ce7e6a4 Compare February 2, 2026 08:49

yeahdongcn mentioned this pull request Feb 6, 2026

[MUSA][10/N] Add GGUF support #18357

Merged

5 tasks

yafengio force-pushed the feat/port-cuda-kernels2 branch from 3ccac0f to 4226d4c Compare April 14, 2026 09:03

yeahdongcn mentioned this pull request Apr 14, 2026

[MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1) #22774

Merged

5 tasks

sgl-project deleted a comment from yafengio Apr 14, 2026

yafengio force-pushed the feat/port-cuda-kernels2 branch from 4226d4c to 43fe4c0 Compare April 14, 2026 13:52

yafengio force-pushed the feat/port-cuda-kernels2 branch from d890804 to 7753f29 Compare April 15, 2026 02:20

sgl-project deleted a comment from yafengio Apr 15, 2026

yafengio force-pushed the feat/port-cuda-kernels2 branch from 7753f29 to 2470b79 Compare April 16, 2026 08:44

yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 2470b79 to 121ebf8 Compare April 17, 2026 13:42

yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 121ebf8 to 0d7bd1d Compare April 20, 2026 05:18

[MUSA] Port CUDA kernels that are compatible with MUSA

8c7dd1e

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 0d7bd1d to 8c7dd1e Compare April 21, 2026 00:52

Merge branch 'main' into feat/port-cuda-kernels2

c9c56f4

Kangyan-Zhou merged commit 74c2e5b into sgl-project:main Apr 24, 2026
334 of 401 checks passed

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

[MUSA][8/N] Port CUDA kernels that are compatible with MUSA (sgl-proj…

d30de81

…ect#17946) Signed-off-by: yafeng.li <yafeng.li@mthreads.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai>

Conversation

yafengio commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What’s Changed

1. CUDA Kernel Porting to MUSA

2. Custom AllReduce for MUSA

3. Build System Integration

Testing Done

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Jan 30, 2026

Uh oh!

yeahdongcn commented Feb 1, 2026

Uh oh!

yafengio commented Apr 14, 2026

Uh oh!

yafengio commented Apr 14, 2026

Uh oh!

yafengio commented Apr 14, 2026

Uh oh!

yafengio commented Apr 15, 2026

Uh oh!

yafengio commented Apr 15, 2026

Uh oh!

yafengio commented Apr 16, 2026

Uh oh!

yafengio commented Apr 16, 2026

Uh oh!

yafengio commented Apr 17, 2026

Uh oh!

yeahdongcn commented Apr 18, 2026

Uh oh!

yafengio commented Apr 20, 2026

Uh oh!

yafengio commented Apr 21, 2026

Uh oh!

alexnails commented Apr 21, 2026

Uh oh!

yeahdongcn commented Apr 22, 2026

Uh oh!

yeahdongcn commented Apr 23, 2026

Uh oh!

yafengio commented Apr 23, 2026

Uh oh!

yafengio commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yafengio commented Jan 29, 2026 •

edited

Loading