[Opt] add biased_group_topk aiter support #7240

kkHuang-amd · 2025-06-16T12:53:08Z

Motivation

Improve biased_group_topk kernel performance by using aiter implementation

New throughput number increased from 1.88 req/s to 1.96 req/s

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 128
Successful requests:                     500
Benchmark duration (s):                  254.53
Total input tokens:                      1600000
Total generated tokens:                  400000
Total generated tokens (retokenized):    398467
Request throughput (req/s):              1.96
Input token throughput (tok/s):          6286.17
Output token throughput (tok/s):         1571.54
Total token throughput (tok/s):          7857.71
Concurrency:                             125.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   63684.08
Median E2E Latency (ms):                 64501.79
---------------Time to First Token----------------
Mean TTFT (ms):                          12188.43
Median TTFT (ms):                        12183.59
P99 TTFT (ms):                           22787.10
---------------Inter-Token Latency----------------
Mean ITL (ms):                           64.45
Median ITL (ms):                         52.21
P95 ITL (ms):                            53.72
P99 ITL (ms):                            54.31
Max ITL (ms):                            21970.00
==================================================

Modifications

Checklist

[✓] Format your code according to the Code Formatting with Pre-Commit.
[✓] Add unit tests as outlined in the Running Unit Tests.
[✓] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[✓] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[✓] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[✓] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @kkHuang-amd, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the performance of the biased grouped top-k expert selection and fused MoE operations within the DeepseekV2 model, particularly for HIP-based systems. It achieves this by integrating specific kernels from the aiter library and introducing shared buffers for routing results, leading to a reported increase in request throughput from 1.88 req/s to 1.96 req/s.

Highlights

Performance Optimization: Introduced support for the aiter.biased_grouped_topk kernel to improve the performance of the biased_group_topk operation, specifically for HIP (AMD) platforms.
Aiter Fused MoE Integration: Integrated the aiter.fused_moe kernel for FP8 block quantization within the MoE layer processing when aiter is enabled.
Shared Routing Buffers: Implemented a mechanism (AiterTopKRoutingBuffers) to create and share top-k routing result buffers (non_shared_topk_ids, non_shared_topk_weights) across all MoE layers when using the aiter fused MoE implementation, aiming to reduce memory allocation overhead.
Token Limit Check: Added an assertion in the model's forward pass to check if the number of tokens exceeds the maximum capacity of the shared aiter routing buffers, providing a clear error message and suggestion to disable SGLANG_USE_AITER if the limit is hit.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an optimization for the biased_group_topk operation by integrating an aiter implementation, primarily targeting HIP environments. This change has demonstrated a performance increase according to the provided benchmarks. The core modifications involve new buffer management (AiterTopKRoutingBuffers) for aiter and updates to the MoE expert selection and fusion logic within fp8.py.

The review identified a few areas for potential improvement:

In python/sglang/srt/utils.py, the AiterTopKRoutingBuffers constructor includes unused parameters (n_routed_experts, n_shared_experts) that could be removed.
In python/sglang/srt/layers/quantization/fp8.py, a TODO comment regarding activation support seems outdated given the new code, and a large block of commented-out code (related to the previous asm_moe implementation) could be removed for clarity.

Overall, the changes align with the goal of performance enhancement and seem well-structured.

python/sglang/srt/layers/quantization/fp8.py

python/sglang/srt/utils.py

HaiShaw · 2025-06-19T06:45:32Z

@kkHuang-amd please cross-check #7279 for the 2nd enhancement.
cc @valarLip

python/sglang/srt/layers/quantization/fp8.py

HaiShaw · 2025-06-19T09:05:39Z

python/sglang/srt/utils.py

+
+
+class AiterTopKRoutingBuffers:
+    MAX_NUM_TOKENS: int = 4096 * 128


comment on this magic numbers

HaiShaw · 2025-06-19T09:28:54Z

python/sglang/srt/models/deepseek_v2.py

+            assert (
+                num_tokens <= AiterTopKRoutingBuffers.MAX_NUM_TOKENS
+            ), f"num_tokens {num_tokens} exceeds MAX_NUM_TOKENS {AiterTopKRoutingBuffers.MAX_NUM_TOKENS}. Consider disabling SGLANG_USE_AITER"
+


fail safe vs. assert, over num of tokens size limit?

amdosoldea · 2025-08-12T20:31:51Z

python/sglang/srt/layers/quantization/fp8.py

        routed_scaling_factor: Optional[float] = None,
    ) -> torch.Tensor:
        from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_experts
        from sglang.srt.layers.moe.topk import select_experts


Can we move importing the select_experts close to the usage?

amdosoldea · 2025-08-12T20:34:55Z

python/sglang/srt/layers/quantization/fp8.py


 if _is_hip:
-    from aiter import ActivationType, QuantType
+    from aiter import ActivationType, QuantType, biased_grouped_topk


Can we move the import of biased_grouped_topk close to the usage?

amdosoldea · 2025-08-12T21:03:12Z

python/sglang/srt/layers/quantization/fp8.py

+        if (
+            _use_aiter
+            and correction_bias is not None
+            and hasattr(layer, "non_shared_topk_weights")


Nit: If the hasattr manifests time penalty, a caching mechanism could help here. I mean the double check at every iteration can affect some performance.

amdosoldea · 2025-08-13T01:04:43Z

python/sglang/srt/layers/quantization/fp8.py

        from sglang.srt.layers.moe.topk import select_experts

        # Expert selection
-        topk_weights, topk_ids = select_experts(


select_experts calls biased_grouped_topk at

biased_grouped_topk

Could you please try changing the parameters such that we do not need an external call of a function?

[Opt] add biased_group_topk aiter support

157b218

kkHuang-amd requested review from BBuf, ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, zhaochenyang20 and zhyncs as code owners June 16, 2025 12:53

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/fp8.py Outdated Show resolved Hide resolved

python/sglang/srt/utils.py Outdated Show resolved Hide resolved

HaiShaw self-assigned this Jun 16, 2025

HaiShaw mentioned this pull request Jun 19, 2025

refine aiter_backend for mtp #7279

Merged

6 tasks

kkHuang-amd and others added 2 commits June 19, 2025 16:52

Merge branch 'main' into aiter_biased_group_topk

0215913

remove non-used parameters

46b4c1a

HaiShaw reviewed Jun 19, 2025

View reviewed changes

HaiShaw marked this pull request as draft June 23, 2025 22:55

amdosoldea reviewed Aug 12, 2025

View reviewed changes

amdosoldea reviewed Aug 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Opt] add biased_group_topk aiter support #7240

[Opt] add biased_group_topk aiter support #7240

kkHuang-amd commented Jun 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HaiShaw commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

HaiShaw Jun 19, 2025

Uh oh!

HaiShaw Jun 19, 2025

Uh oh!

amdosoldea Aug 12, 2025

Uh oh!

amdosoldea Aug 12, 2025

Uh oh!

amdosoldea Aug 12, 2025 •

edited

Loading

Uh oh!

amdosoldea Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		class AiterTopKRoutingBuffers:
		MAX_NUM_TOKENS: int = 4096 * 128

[Opt] add biased_group_topk aiter support #7240

Are you sure you want to change the base?

[Opt] add biased_group_topk aiter support #7240

Conversation

kkHuang-amd commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HaiShaw commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

HaiShaw Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

HaiShaw Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

amdosoldea Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

amdosoldea Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

amdosoldea Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amdosoldea Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkHuang-amd commented Jun 16, 2025 •

edited

Loading

amdosoldea Aug 12, 2025 •

edited

Loading