Skip to content

Conversation

@kkHuang-amd
Copy link
Collaborator

@kkHuang-amd kkHuang-amd commented Jun 16, 2025

Motivation

Improve biased_group_topk kernel performance by using aiter implementation

New throughput number increased from 1.88 req/s to 1.96 req/s

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 128
Successful requests:                     500
Benchmark duration (s):                  254.53
Total input tokens:                      1600000
Total generated tokens:                  400000
Total generated tokens (retokenized):    398467
Request throughput (req/s):              1.96
Input token throughput (tok/s):          6286.17
Output token throughput (tok/s):         1571.54
Total token throughput (tok/s):          7857.71
Concurrency:                             125.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   63684.08
Median E2E Latency (ms):                 64501.79
---------------Time to First Token----------------
Mean TTFT (ms):                          12188.43
Median TTFT (ms):                        12183.59
P99 TTFT (ms):                           22787.10
---------------Inter-Token Latency----------------
Mean ITL (ms):                           64.45
Median ITL (ms):                         52.21
P95 ITL (ms):                            53.72
P99 ITL (ms):                            54.31
Max ITL (ms):                            21970.00
==================================================

Modifications

Checklist

  • [✓] Format your code according to the Code Formatting with Pre-Commit.
  • [✓] Add unit tests as outlined in the Running Unit Tests.
  • [✓] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [✓] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [✓] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [✓] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kkHuang-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the performance of the biased grouped top-k expert selection and fused MoE operations within the DeepseekV2 model, particularly for HIP-based systems. It achieves this by integrating specific kernels from the aiter library and introducing shared buffers for routing results, leading to a reported increase in request throughput from 1.88 req/s to 1.96 req/s.

Highlights

  • Performance Optimization: Introduced support for the aiter.biased_grouped_topk kernel to improve the performance of the biased_group_topk operation, specifically for HIP (AMD) platforms.
  • Aiter Fused MoE Integration: Integrated the aiter.fused_moe kernel for FP8 block quantization within the MoE layer processing when aiter is enabled.
  • Shared Routing Buffers: Implemented a mechanism (AiterTopKRoutingBuffers) to create and share top-k routing result buffers (non_shared_topk_ids, non_shared_topk_weights) across all MoE layers when using the aiter fused MoE implementation, aiming to reduce memory allocation overhead.
  • Token Limit Check: Added an assertion in the model's forward pass to check if the number of tokens exceeds the maximum capacity of the shared aiter routing buffers, providing a clear error message and suggestion to disable SGLANG_USE_AITER if the limit is hit.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for the biased_group_topk operation by integrating an aiter implementation, primarily targeting HIP environments. This change has demonstrated a performance increase according to the provided benchmarks. The core modifications involve new buffer management (AiterTopKRoutingBuffers) for aiter and updates to the MoE expert selection and fusion logic within fp8.py.

The review identified a few areas for potential improvement:

  • In python/sglang/srt/utils.py, the AiterTopKRoutingBuffers constructor includes unused parameters (n_routed_experts, n_shared_experts) that could be removed.
  • In python/sglang/srt/layers/quantization/fp8.py, a TODO comment regarding activation support seems outdated given the new code, and a large block of commented-out code (related to the previous asm_moe implementation) could be removed for clarity.

Overall, the changes align with the goal of performance enhancement and seem well-structured.

@HaiShaw HaiShaw self-assigned this Jun 16, 2025
@HaiShaw HaiShaw mentioned this pull request Jun 19, 2025
6 tasks
@HaiShaw
Copy link
Collaborator

HaiShaw commented Jun 19, 2025

@kkHuang-amd please cross-check #7279 for the 2nd enhancement.
cc @valarLip



class AiterTopKRoutingBuffers:
MAX_NUM_TOKENS: int = 4096 * 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on this magic numbers

assert (
num_tokens <= AiterTopKRoutingBuffers.MAX_NUM_TOKENS
), f"num_tokens {num_tokens} exceeds MAX_NUM_TOKENS {AiterTopKRoutingBuffers.MAX_NUM_TOKENS}. Consider disabling SGLANG_USE_AITER"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fail safe vs. assert, over num of tokens size limit?

@HaiShaw HaiShaw marked this pull request as draft June 23, 2025 22:55
routed_scaling_factor: Optional[float] = None,
) -> torch.Tensor:
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_experts
from sglang.srt.layers.moe.topk import select_experts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move importing the select_experts close to the usage?


if _is_hip:
from aiter import ActivationType, QuantType
from aiter import ActivationType, QuantType, biased_grouped_topk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the import of biased_grouped_topk close to the usage?

if (
_use_aiter
and correction_bias is not None
and hasattr(layer, "non_shared_topk_weights")
Copy link

@amdosoldea amdosoldea Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If the hasattr manifests time penalty, a caching mechanism could help here. I mean the double check at every iteration can affect some performance.

from sglang.srt.layers.moe.topk import select_experts

# Expert selection
topk_weights, topk_ids = select_experts(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_experts calls biased_grouped_topk at

biased_grouped_topk

Could you please try changing the parameters such that we do not need an external call of a function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants