[AMD] Support Qwen3-Coder-Next on AMD platform by yichiche · Pull Request #18355 · sgl-project/sglang

yichiche · 2026-02-06T08:16:01Z

Motivation

Enable Qwen3-Coder-Next model on AMD GPU platform. With this PR, we are able to support non-MTP (fp8 kv cache) and MTP on Qwen3-Coder-Next.

Modifications

aiter_backend.py:
- Handle v_head_dim correctly for MLA and hybrid linear models. Previously, v_head_dim was retrieved directly from token_to_kv_pool.get_value_buffer(0), which fails for models where layer 0 may not be a full attention layer. Now properly handles MLA models (using model config), hybrid linear models (using get_v_head_dim()), and standard models.
- Enable MTP with triton backend, will support aiter MTP for non-mla decode kernel in the future
qwen3_next.py: Disable dual-stream on AMD platform
hybrid_linear_attn_backend.py: Make CuTe DSL GDN import optional and raise an explicit error only when CuTe DSL decode is enabled without required dependency.

Accuracy Tests

MODEL="/data/Qwen/Qwen3-Coder-Next/"
python3 -m sglang.launch_server \
 --model-path "${MODEL}" \
 --tensor-parallel-size 8 \
 --trust-remote-code \
 --chunked-prefill-size 131072 \
 --host 0.0.0.0 \
 --port 9000 \
 --log-requests \
 --disable-radix-cache \
 --mem-fraction-static 0.8 \
 --attention-backend aiter

Accuracy: 0.944
Invalid: 0.000
Latency: 55.824 s
Output throughput: 3066.797 token/s

Benchmarking and Profiling

Env: MI355 * 8

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     8         
Benchmark duration (s):                  23.50     
Total input tokens:                      560000    
Total input text tokens:                 560000    
Total generated tokens:                  1600      
Total generated tokens (retokenized):    1600      
Request throughput (req/s):              0.34      
Input token throughput (tok/s):          23834.71  
Output token throughput (tok/s):         68.10     
Peak output token throughput (tok/s):    95.00     
Peak concurrent requests:                2         
Total token throughput (tok/s):          23902.81  
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2934.06   
Median E2E Latency (ms):                 2929.72   
P90 E2E Latency (ms):                    2972.40   
P99 E2E Latency (ms):                    2974.58   
---------------Time to First Token----------------
Mean TTFT (ms):                          842.12    
Median TTFT (ms):                        838.34    
P99 TTFT (ms):                           882.41    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.51     
Median TPOT (ms):                        10.51     
P99 TPOT (ms):                           10.52     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.51     
Median ITL (ms):                         10.51     
P95 ITL (ms):                            10.58     
P99 ITL (ms):                            10.74     
Max ITL (ms):                            11.94     
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-06T08:16:20Z

Summary of Changes

Hello @yichiche, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces critical updates to enable and optimize the Qwen3-Coder-Next model's performance and compatibility on AMD platforms. The changes focus on improving the robustness of attention mechanism configurations by correctly handling model-specific parameters and adapting dual-stream behavior for AMD GPUs. Additionally, it enhances dependency management for optional performance optimizations, leading to a more stable and compatible experience for users deploying Qwen3-Coder-Next on AMD hardware.

Highlights

Enhanced v_head_dim Calculation: The logic for determining v_head_dim in aiter_backend.py has been refined to correctly handle MLA (Multi-Layer Attention) and hybrid linear models, preventing issues where layer 0 might not be a full attention layer.
AMD Platform Dual-Stream Disablement: Dual-stream functionality has been explicitly disabled for AMD platforms in qwen3_next.py to ensure compatibility and stability.
Optional CuTe DSL GDN Import: The import of CuTe DSL GDN in hybrid_linear_attn_backend.py is now optional, with an explicit ImportError raised only when SGLANG_USE_CUTEDSL_GDN_DECODE is enabled but the required cutlass dependency is missing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/layers/attention/aiter_backend.py
- Removed the direct initialization of self.v_head_dim from token_to_kv_pool.get_value_buffer(0).shape[-1].
- Implemented conditional logic to set self.v_head_dim based on the model type: using model_runner.model_config.v_head_dim for MLA models, model_runner.token_to_kv_pool.get_v_head_dim() for hybrid linear models, and falling back to the original method for standard models.
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
- Initialized cutedsl_fused_sigmoid_gating_delta_rule_update to None and introduced _cutedsl_import_error to track potential import failures.
- Wrapped the import of cutedsl_fused_sigmoid_gating_delta_rule_update in a try-except block, making it an optional dependency.
- Added an ImportError check within the __init__ method to explicitly notify users if SGLANG_USE_CUTEDSL_GDN_DECODE is enabled but the cutlass dependency is not found.
python/sglang/srt/models/qwen3_next.py
- Reordered the conditions within the if statement in the _forward_input_proj method, moving seq_len < DUAL_STREAM_TOKEN_THRESHOLD to the end. This change, as described in the PR, effectively disables dual-stream functionality on AMD platforms.

Activity

The pull request was created by yichiche.
No human activity (comments, reviews, or progress updates) has been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Qwen3-Coder-Next on the AMD platform. The changes are well-structured and improve the codebase's robustness and modularity. Key modifications include a more reliable method for determining v_head_dim in aiter_backend.py to accommodate various model architectures, and making the CuTe DSL dependency optional in hybrid_linear_attn_backend.py with clear error handling. I have one suggestion to make the exception handling more specific, which will improve maintainability.

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py

gemini-code-assist · 2026-02-10T03:14:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

1am9trash · 2026-02-10T15:31:37Z

I see the test uses --chunked-prefill-size 32768 (typically use 131072). Does Qwen3-coder-next tend to perform better with chunk prefill?

yichiche · 2026-02-11T02:41:10Z

I see the test uses --chunked-prefill-size 32768 (typically use 131072). Does Qwen3-coder-next tend to perform better with chunk prefill?

With the change from --chunked-prefill-size 32768 to 131072, we see TTFT improvement from 887.24 to 838.34 (6% uplift).

- aiter_backend.py: Handle v_head_dim correctly for MLA and hybrid linear models. Previously, v_head_dim was retrieved directly from token_to_kv_pool.get_value_buffer(0), which fails for models where layer 0 may not be a full attention layer. Now properly handles MLA models (using model config), hybrid linear models (using get_v_head_dim()), and standard models. - qwen3_next.py: Use is_cuda_alike() instead of is_cuda() to enable CUDA stream creation on both NVIDIA CUDA and AMD ROCm/HIP devices.

…um change

yichiche · 2026-02-23T04:56:56Z

Solve conflict and rebase again.

This reverts commit a4650f8.

python/sglang/srt/models/qwen3_next.py

python/sglang/srt/layers/attention/aiter_backend.py

hubertlu-tw · 2026-02-25T19:55:12Z

@yichiche do we currently have test coverage for this model or this model arch in our CI?

Co-authored-by: yichiche@amd.com <jacky.cheng>

yichiche · 2026-02-26T09:57:28Z

@yichiche do we currently have test coverage for this model or this model arch in our CI?

@hubertlu-tw yes, this is in another PR: #18608

Co-authored-by: yichiche@amd.com <jacky.cheng>

yichiche requested review from Fridge003, Qiaolin-Yu, hanming-lu, hebiao064, ispobock, merrymercy and yizhang2077 as code owners February 6, 2026 08:16

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py Outdated Show resolved Hide resolved

yichiche marked this pull request as draft February 6, 2026 08:31

yichiche force-pushed the enable-qwen3-coder-next-hip branch from 1cf0da3 to 6646948 Compare February 10, 2026 03:05

yichiche marked this pull request as ready for review February 10, 2026 03:14

yichiche force-pushed the enable-qwen3-coder-next-hip branch from 6646948 to 63f1114 Compare February 10, 2026 03:15

yctseng0211 added amd run-ci labels Feb 10, 2026

HaiShaw self-assigned this Feb 12, 2026

yichiche added 4 commits February 23, 2026 04:41

Fix dependency on non-nvidia platform

a4650f8

AMD platorm doesn't support dual stream, revert it and fix with minim…

578ad39

…um change

Use triton kernel for MTP decode

166541f

yichiche force-pushed the enable-qwen3-coder-next-hip branch from 482a922 to 166541f Compare February 23, 2026 04:56

yichiche requested a review from HaiShaw as a code owner February 23, 2026 04:56

Revert "Fix dependency on non-nvidia platform"

920ee69

This reverts commit a4650f8.

HaiShaw requested changes Feb 25, 2026

View reviewed changes

python/sglang/srt/models/qwen3_next.py Show resolved Hide resolved

yichiche commented Feb 25, 2026

View reviewed changes

python/sglang/srt/layers/attention/aiter_backend.py Show resolved Hide resolved

HaiShaw approved these changes Feb 25, 2026

View reviewed changes

Merge branch 'main' into enable-qwen3-coder-next-hip

51106de

HaiShaw merged commit b2c46fc into sgl-project:main Feb 25, 2026
87 of 102 checks passed

klhhhhh pushed a commit to klhhhhh/sglang that referenced this pull request Feb 26, 2026

[AMD] Support Qwen3-Coder-Next on AMD platform (sgl-project#18355)

7637deb

Co-authored-by: yichiche@amd.com <jacky.cheng>

michaelzhang-ai mentioned this pull request Mar 3, 2026

[AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to extend_attention_fwd in aiter_backend #19736

Merged

1 task

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[AMD] Support Qwen3-Coder-Next on AMD platform (sgl-project#18355)

c3c5405

Co-authored-by: yichiche@amd.com <jacky.cheng>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[AMD] Support Qwen3-Coder-Next on AMD platform (sgl-project#18355)

551254b

Co-authored-by: yichiche@amd.com <jacky.cheng>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Support Qwen3-Coder-Next on AMD platform#18355

[AMD] Support Qwen3-Coder-Next on AMD platform#18355
HaiShaw merged 6 commits intosgl-project:mainfrom
yichiche:enable-qwen3-coder-next-hip

yichiche commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

1am9trash commented Feb 10, 2026

Uh oh!

yichiche commented Feb 11, 2026

Uh oh!

yichiche commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hubertlu-tw commented Feb 25, 2026

Uh oh!

yichiche commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yichiche commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

1am9trash commented Feb 10, 2026

Uh oh!

yichiche commented Feb 11, 2026

Uh oh!

yichiche commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hubertlu-tw commented Feb 25, 2026

Uh oh!

yichiche commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yichiche commented Feb 6, 2026 •

edited

Loading

yichiche commented Feb 26, 2026 •

edited

Loading