Skip to content

[MUSA][8/N] Port CUDA kernels that are compatible with MUSA#17946

Merged
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
yafengio:feat/port-cuda-kernels2
Apr 24, 2026
Merged

[MUSA][8/N] Port CUDA kernels that are compatible with MUSA#17946
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
yafengio:feat/port-cuda-kernels2

Conversation

@yafengio
Copy link
Copy Markdown
Contributor

@yafengio yafengio commented Jan 29, 2026

Motivation

This PR continues the ongoing effort (tracked in #16565) to add full support for Moore Threads GPUs in SGLang by leveraging MUSA (Meta-computing Unified System Architecture) for LLM inference.

The primary goal of this submission is to enable core kernel functionality on MUSA by porting CUDA kernels that are compatible with the MUSA programming model, while keeping the codebase unified across CUDA, ROCm, and MUSA backends.


What’s Changed

This PR focuses on the following areas:

1. CUDA Kernel Porting to MUSA

  • Ported a set of CUDA kernels that are compatible with MUSA to native MUSA implementations.

  • Covered kernel categories include:

    • Custom All Reduce
    • Elementwise operations
    • GEMM
    • MoE and fused MoE
    • KV cache and memory utilities
    • Speculative decoding
    • Quantization (GGUF)
  • Conditional compilation via USE_MUSA is used to preserve multi-backend compatibility.

2. Custom AllReduce for MUSA

  • Added a MUSA-specific implementation of custom_all_reduce_2shot.
  • Uses MUSA-native synchronization and memory semantics instead of reusing CUDA/NCCL paths.
  • Provides a foundation for multi-GPU inference on Moore Threads hardware.

3. Build System Integration

  • Added and updated MUSA-specific build configuration:

    • setup_musa.py
    • pyproject_musa.toml
  • Updated dependencies to ensure compatibility (e.g. bumped torchada).

  • Verified that sgl-kernel can be built and installed in a clean torch_musa container.

Testing Done

Tested in a clean torch_musa container.

<===Click to expand log details===>
root@worker:/ws/sgl-kernel# python setup_musa.py install

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/local/lib/python3.10/dist-packages/torch_musa/utils/musa_extension.py:757: UserWarning: TORCH_MUSA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_MUSA_ARCH_LIST'].
  warnings.warn(

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/local/lib/python3.10/dist-packages/setuptools_scm/_integration/setuptools.py:24: RuntimeWarning: 
ERROR: setuptools==59.6.0 is used in combination with setuptools-scm>=8.x

Your build configuration is incomplete and previously worked by accident!
setuptools-scm requires setuptools>=61 (recommended: >=80)

Suggested workaround if applicable:
 - migrating from the deprecated setup_requires mechanism to pep517/518
   and using a pyproject.toml to declare build dependencies
   which are reliably pre-installed before running the build tools

  warnings.warn(

running install
2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(

2026-01-29 08:11:32 | warnings | 139984277657408 | WARNING : /usr/lib/python3/dist-packages/setuptools/command/easy_install.py:158: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(

running bdist_egg
running egg_info
writing python/sgl_kernel.egg-info/PKG-INFO
writing dependency_links to python/sgl_kernel.egg-info/dependency_links.txt
writing top-level names to python/sgl_kernel.egg-info/top_level.txt
adding license file 'LICENSE'
writing manifest file 'python/sgl_kernel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.10
creating build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/attention.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/spatial.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/gemm.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/sampling.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/sparse_flash_attn.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/cutlass_moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/test_utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/memory.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/fused_moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/elementwise.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/flash_mla.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/_fa4_interface.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/expert_specialization.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/marlin.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/flash_attn.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/kvcacheio.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/hadamard.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/version.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/mamba.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/top_k.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/moe.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/allreduce.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/load_utils.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/grammar.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/speculative.py -> build/lib.linux-x86_64-3.10/sgl_kernel
copying python/sgl_kernel/scalar_type.py -> build/lib.linux-x86_64-3.10/sgl_kernel
creating build/lib.linux-x86_64-3.10/sgl_kernel/testing
copying python/sgl_kernel/testing/rotary_embedding.py -> build/lib.linux-x86_64-3.10/sgl_kernel/testing
copying python/sgl_kernel/testing/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel/testing
creating build/lib.linux-x86_64-3.10/sgl_kernel/quantization
copying python/sgl_kernel/quantization/__init__.py -> build/lib.linux-x86_64-3.10/sgl_kernel/quantization
copying python/sgl_kernel/quantization/gguf.py -> build/lib.linux-x86_64-3.10/sgl_kernel/quantization
running build_ext
Cloning third-party repositories...
Third-party repositories ready.
building 'sgl_kernel.common_ops' extension
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/allreduce_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/attention_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/elementwise_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/gemm_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/grammar_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/kvcacheio_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/mamba_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/memory_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/moe_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/quantization
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/quantization/gguf_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc/speculative_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/csrc_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party/flashinfer
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/sgl-workspace/sglang/sgl-kernel/third_party/flashinfer/csrc_musa
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/csrc
creating /sgl-workspace/sglang/sgl-kernel/build/temp.linux-x86_64-3.10/csrc/elementwise
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/attention.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/spatial.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/gemm.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/sampling.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/sparse_flash_attn.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/cutlass_moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/test_utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/memory.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/fused_moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/elementwise.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/flash_mla.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/_fa4_interface.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/expert_specialization.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/marlin.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/flash_attn.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/kvcacheio.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/hadamard.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/version.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/mamba.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/top_k.py -> build/bdist.linux-x86_64/egg/sgl_kernel
creating build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/testing/rotary_embedding.py -> build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/testing/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel/testing
copying build/lib.linux-x86_64-3.10/sgl_kernel/moe.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/allreduce.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/common_ops.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/load_utils.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/grammar.py -> build/bdist.linux-x86_64/egg/sgl_kernel
creating build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/quantization/__init__.py -> build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/quantization/gguf.py -> build/bdist.linux-x86_64/egg/sgl_kernel/quantization
copying build/lib.linux-x86_64-3.10/sgl_kernel/speculative.py -> build/bdist.linux-x86_64/egg/sgl_kernel
copying build/lib.linux-x86_64-3.10/sgl_kernel/scalar_type.py -> build/bdist.linux-x86_64/egg/sgl_kernel
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/attention.py to attention.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/spatial.py to spatial.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/gemm.py to gemm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/sampling.py to sampling.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/sparse_flash_attn.py to sparse_flash_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/cutlass_moe.py to cutlass_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/test_utils.py to test_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/memory.py to memory.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/fused_moe.py to fused_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/elementwise.py to elementwise.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/flash_mla.py to flash_mla.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/_fa4_interface.py to _fa4_interface.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/expert_specialization.py to expert_specialization.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/marlin.py to marlin.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/flash_attn.py to flash_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/kvcacheio.py to kvcacheio.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/hadamard.py to hadamard.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/version.py to version.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/mamba.py to mamba.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/top_k.py to top_k.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/testing/rotary_embedding.py to rotary_embedding.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/testing/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/moe.py to moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/allreduce.py to allreduce.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/load_utils.py to load_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/grammar.py to grammar.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/quantization/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/quantization/gguf.py to gguf.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/speculative.py to speculative.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/scalar_type.py to scalar_type.cpython-310.pyc
creating stub loader for sgl_kernel/common_ops.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/sgl_kernel/common_ops.py to common_ops.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/sgl_kernel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
creating dist
creating 'dist/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
removing '/usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
Extracting sgl_kernel-0.3.21-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
sgl-kernel 0.3.21 is already the active version in easy-install.pth

Installed /usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.21-py3.10-linux-x86_64.egg
Processing dependencies for sgl-kernel==0.3.21
Finished processing dependencies for sgl-kernel==0.3.21

Launch the server:

export MUSA_VISIBLE_DEVICES=0,1
python3 -m sglang.launch_server \
    --model-path /home/dist/Qwen3-0.6B/ \
    --served-model-name base-model \
    --disable-radix-cache \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 31000 \
    --tp-size 2 \
    --dp-size 2 \
    --enable-dp-attention \
    --cuda-graph-max-bs 10 \

   # --sampling-backend flashinfer \   # Note: keep it disabled for now.

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added quant LLM Quantization dependencies Pull requests that update a dependency file sgl-kernel labels Jan 29, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yafengio, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the support for Moore Threads GPUs by porting a wide array of existing CUDA kernels to the MUSA architecture. The changes introduce MUSA-specific implementations for critical operations like all-reduce and fused normalization, ensuring optimal performance and compatibility. This work is a crucial step towards enabling efficient Large Language Model (LLM) inference on Moore Threads hardware, maintaining a unified codebase through conditional compilation.

Highlights

  • MUSA Kernel Porting: Many core CUDA kernels across various functionalities (allreduce, attention, elementwise, GEMM, MoE, etc.) have been ported to be compatible with the MUSA architecture, enabling execution on Moore Threads GPUs.
  • Custom AllReduce Implementation: An optimized distributed communication primitive, custom_all_reduce_2shot, has been implemented specifically for MUSA's hardware characteristics, including MUSA-specific synchronization primitives and memory operations.
  • Multi-Platform Compatibility: Extensive use of USE_MUSA preprocessor guards has been added throughout the codebase to ensure compatibility and conditional compilation for CUDA, ROCm, and MUSA backends.
  • Build System Integration: The build system (setup_musa.py and pyproject_musa.toml) has been updated to include the new MUSA-specific source files and adjust dependencies, such as torchada.
  • Fused Add RMS Norm Kernel for MUSA: A new MUSA-specific kernel (fused_add_rms_norm_kernel.mu) has been added to perform fused add RMS normalization, leveraging MUSA's vectorized types and optimized memory access.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports a significant number of CUDA kernels to the MUSA architecture, enabling GPU acceleration on Moore Threads hardware. The changes include adding MUSA-specific implementations using preprocessor guards, updating build configurations, and introducing a custom AllReduce implementation for MUSA. The porting effort is extensive and seems well-executed. I've found one critical issue related to a missing kernel dispatch path for the float data type and a few minor maintainability issues. Overall, great work on this large porting task.

Comment thread sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.mu Outdated
Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated
Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated
Comment thread sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.mu Outdated
Comment thread sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.mu Outdated
Comment thread sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.mu Outdated
Comment thread sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.mu Outdated
Comment thread sgl-kernel/csrc/moe/moe_sum_reduce.cu Outdated
Comment thread sgl-kernel/pyproject_musa.toml Outdated
@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch 4 times, most recently from 0d68ad3 to 607c88f Compare January 30, 2026 02:11
Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated
Comment thread sgl-kernel/csrc/allreduce/custom_all_reduce.cuh Outdated
Comment thread sgl-kernel/csrc/gemm/dsv3_fused_a_gemm.cu Outdated
@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch from 607c88f to 3dd4964 Compare January 30, 2026 02:42
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yafengio yafengio marked this pull request as draft January 30, 2026 13:04
@yeahdongcn
Copy link
Copy Markdown
Collaborator

Please also rebase onto upstream/main, as #18035 has been merged. Thanks!

@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch 3 times, most recently from 30bfc6c to ce7e6a4 Compare February 2, 2026 08:49
@yeahdongcn yeahdongcn mentioned this pull request Feb 6, 2026
5 tasks
@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch from 3ccac0f to 4226d4c Compare April 14, 2026 09:03
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch from 4226d4c to 43fe4c0 Compare April 14, 2026 13:52
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch from d890804 to 7753f29 Compare April 15, 2026 02:20
@sgl-project sgl-project deleted a comment from yafengio Apr 15, 2026
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yafengio yafengio force-pushed the feat/port-cuda-kernels2 branch from 7753f29 to 2470b79 Compare April 16, 2026 08:44
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yeahdongcn yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 2470b79 to 121ebf8 Compare April 17, 2026 13:42
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yeahdongcn yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 121ebf8 to 0d7bd1d Compare April 20, 2026 05:18
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
@yeahdongcn yeahdongcn force-pushed the feat/port-cuda-kernels2 branch from 0d7bd1d to 8c7dd1e Compare April 21, 2026 00:52
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@alexnails
Copy link
Copy Markdown
Collaborator

Is this just CI Flakiness or what are issues for merging this?

@yeahdongcn
Copy link
Copy Markdown
Collaborator

Is this just CI Flakiness or what are issues for merging this?

We've been unable to get all NVIDIA CI checks to pass despite multiple attempts. Could you please help merge this? Thanks!

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yafengio
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 74c2e5b into sgl-project:main Apr 24, 2026
334 of 401 checks passed
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
…ect#17946)

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
Co-authored-by: Alex Nails <alex.nails@radixark.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file mthreads quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants