Skip to content

[AMD]Integrate aiter's fused_topk for softmax scoring in topk function#21421

Merged
HaiShaw merged 2 commits intosgl-project:mainfrom
zhentaocc:aiter_fused_topk
Mar 26, 2026
Merged

[AMD]Integrate aiter's fused_topk for softmax scoring in topk function#21421
HaiShaw merged 2 commits intosgl-project:mainfrom
zhentaocc:aiter_fused_topk

Conversation

@zhentaocc
Copy link
Copy Markdown
Contributor

@zhentaocc zhentaocc commented Mar 25, 2026

Motivation

Enable AIter-backed paths for ROCm/HIP to fuse softmax+topk: MoE TopK.

Modifications

When aiter is enabled, by default use aiter.fused_topk

Accuracy Tests

Before

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9719 ± 0.0045
strict-match 5 exact_match 0.9727 ± 0.0045

After

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9712 ± 0.0046
strict-match 5 exact_match 0.9719 ± 0.0045

Comparation for different impls of topk_softmax.

Baseline sgl-kernel.

SUMMARY — deepseek-ai/DeepSeek-V3 (E=[256], topk=[8])

bs seq dtype token expert topk torch_us sgl_us sgl_err aiter_us aiter_err asm_us asm_err aiter_vs_sgl asm_vs_sgl
1 1 bfloat16 1 256 8 21.39 7.67 0 5.88 0 2.33 0 1.30x 3.29x
4 1 bfloat16 4 256 8 24.91 7.58 0.375 5.64 0 2.45 0.375 1.34x 3.09x
8 1 bfloat16 8 256 8 28.83 7.52 0.515625 5.69 0 2.97 0.515625 1.32x 2.53x
16 1 bfloat16 16 256 8 29.95 7.86 0.734375 5.72 0 3.43 0.734375 1.37x 2.29x
32 1 bfloat16 32 256 8 30.99 8.31 0.742188 6.6 0 4.06 0.742188 1.26x 2.05x
64 1 bfloat16 64 256 8 30.75 9.09 0.765625 7.04 0 4.09 0.765625 1.29x 2.22x
128 1 bfloat16 128 256 8 30.75 9.16 0.78125 7.39 0 4.08 0.78125 1.24x 2.25x
1 1024 bfloat16 1024 256 8 13.84 9.33 0.816772 8.34 0 4.17 0.816772 1.12x 2.24x
4 1024 bfloat16 4096 256 8 41.99 10.24 0.811737 8.53 0 5.03 0.811737 1.20x 2.04x
8 1024 bfloat16 8192 256 8 55.75 12.41 0.811554 9.31 0 7.14 0.811554 1.33x 1.74x
16 1024 bfloat16 16384 256 8 77.67 17.26 0.812881 13.35 0 11.26 0.812881 1.29x 1.53x
32 1024 bfloat16 32768 256 8 127.07 28.82 0.810478 19.77 0 16.62 0.810478 1.46x 1.73x
64 1024 bfloat16 65536 256 8 228.15 51.36 0.810671 34.13 0 26.63 0.810671 1.50x 1.93x
128 1024 bfloat16 131072 256 8 439.72 94.25 0.810811 61.98 0 48.69 0.810811 1.52x 1.94x
32 8192 bfloat16 262144 256 8 837.24 174.58 0.81068 114.1 0 89.31 0.810679 1.53x 1.95x
64 8192 bfloat16 524288 256 8 1676.04 347.87 0.810576 222.96 0 173.43 0.810576 1.56x 2.01x
128 8192 bfloat16 1048576 256 8 3248.94 667.91 0.8105 431.46 0 335.44 0.8105 1.55x 1.99x

SUMMARY — Qwen/Qwen3.5-397B-A17B (E=[512], topk=[10])

bs seq dtype token expert topk torch_us sgl_us sgl_err aiter_us aiter_err asm_us asm_err aiter_vs_sgl
1 1 bfloat16 1 512 10 33.47 9.66 0 7.38 0 N/A N/A 1.31x
4 1 bfloat16 4 512 10 36.39 10.6 0.35 7.28 0 N/A N/A 1.46x
8 1 bfloat16 8 512 10 50.03 11.7 0.2875 7.62 0 N/A N/A 1.54x
16 1 bfloat16 16 512 10 50.71 11.61 0.6375 8.12 0 N/A N/A 1.43x
32 1 bfloat16 32 512 10 49.91 11.7 0.675 8.85 0 N/A N/A 1.32x
64 1 bfloat16 64 512 10 49.75 11.89 0.684375 8.98 0 N/A N/A 1.32x
128 1 bfloat16 128 512 10 50.27 11.91 0.696875 9.05 0 N/A N/A 1.32x
1 1024 bfloat16 1024 512 10 85.83 13.45 0.739746 9.21 0 N/A N/A 1.46x
4 1024 bfloat16 4096 512 10 99.38 32.04 0.754248 11.02 0 N/A N/A 2.91x
8 1024 bfloat16 8192 512 10 139.55 57.02 0.754077 16.97 0 N/A N/A 3.36x
16 1024 bfloat16 16384 512 10 229.74 107.24 0.753302 24.17 0 N/A N/A 4.44x
32 1024 bfloat16 32768 512 10 409.07 200.6 0.751654 43.4 0 N/A N/A 4.62x
64 1024 bfloat16 65536 512 10 882.47 386.09 0.753418 79.14 0 N/A N/A 4.88x
128 1024 bfloat16 131072 512 10 1730.6 855.39 0.753337 146.01 0 N/A N/A 5.86x
32 8192 bfloat16 262144 512 10 3400.17 1719.66 0.752707 284.17 0 N/A N/A 6.05x
64 8192 bfloat16 524288 512 10 6787.97 3436.18 0.753321 553.39 0 N/A N/A 6.21x
128 8192 bfloat16 1048576 512 10 13496.4 6896.04 0.753394 1096.23 0 N/A N/A 6.29x

Benchmark

bs=64, 1k1k
Before

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  244.93    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587513    
Request throughput (req/s):              2.61      
Input token throughput (tok/s):          2412.29   
Output token throughput (tok/s):         2404.94   
Peak output token throughput (tok/s):    3094.00   
Peak concurrent requests:                75        
Total token throughput (tok/s):          4817.23   
Concurrency:                             62.48     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23912.46  
Median E2E Latency (ms):                 23816.68  
P90 E2E Latency (ms):                    26818.09  
P99 E2E Latency (ms):                    28128.11  
---------------Time to First Token----------------
Mean TTFT (ms):                          186.30    
Median TTFT (ms):                        85.04     
P99 TTFT (ms):                           1469.97   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.79     
Median TPOT (ms):                        26.25     
P99 TPOT (ms):                           27.79     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.81     
Median ITL (ms):                         21.16     
P95 ITL (ms):                            91.78     
P99 ITL (ms):                            134.56    
Max ITL (ms):                            1359.95   
==================================================

After

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  240.31    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587251    
Request throughput (req/s):              2.66      
Input token throughput (tok/s):          2458.69   
Output token throughput (tok/s):         2451.20   
Peak output token throughput (tok/s):    3136.00   
Peak concurrent requests:                76        
Total token throughput (tok/s):          4909.89   
Concurrency:                             62.48     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23460.42  
Median E2E Latency (ms):                 23373.80  
P90 E2E Latency (ms):                    26262.06  
P99 E2E Latency (ms):                    27411.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          173.04    
Median TTFT (ms):                        83.74     
P99 TTFT (ms):                           1305.24   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.31     
Median TPOT (ms):                        25.79     
P99 TPOT (ms):                           26.88     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.33     
Median ITL (ms):                         20.96     
P95 ITL (ms):                            88.58     
P99 ITL (ms):                            100.73    
Max ITL (ms):                            1203.00   
==================================================

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to significantly improve the performance of Mixture-of-Experts (MoE) TopK operations, especially on ROCm/HIP platforms, by leveraging the aiter library's fused softmax+topk kernels. The changes introduce conditional logic to utilize aiter's optimized functions when available, providing an auto-dispatch mechanism for efficient computation, while maintaining a graceful fallback for environments where aiter is not enabled.

Highlights

  • AIter Integration for Softmax TopK: Integrated aiter's fused_topk for softmax scoring within the fused_topk function, enabling enhanced performance with auto-dispatch capabilities, particularly for MoE TopK on ROCm/HIP.
  • Fallback Mechanism: Implemented a fallback to the existing topk_softmax function if aiter is not available, ensuring compatibility and robustness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates aiter's fused_topk functionality into the fused_topk function for softmax scoring when _use_aiter is enabled. The changes involve adding a new import for topk_softmax and conditionally using aiter.fused_moe.fused_topk. Feedback suggests that the newly added topk_softmax import is unused and should be removed. Additionally, the aiter.fused_moe.fused_topk import, currently located within the fused_topk function, should be moved to the top-level try-except block for better code organization and centralization of aiter imports.

Comment thread python/sglang/srt/layers/moe/topk.py Outdated
Comment thread python/sglang/srt/layers/moe/topk.py Outdated
@zhentaocc zhentaocc changed the title Integrate aiter's fused_topk for softmax scoring in topk function, en… [AMD]Integrate aiter's fused_topk for softmax scoring in topk function Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@yichiche yichiche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Chen, Todd added 2 commits March 26, 2026 02:13
…hancing performance with auto-dispatch capabilities. Fall back to topk_softmax if aiter is not available.
…softmax scoring implementation, ensuring compatibility with aiter's features.
@HaiShaw HaiShaw merged commit fd53594 into sgl-project:main Mar 26, 2026
43 of 68 checks passed
satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants