-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[Opt] add biased_group_topk aiter support #7240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kkHuang-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on optimizing the performance of the biased grouped top-k expert selection and fused MoE operations within the DeepseekV2 model, particularly for HIP-based systems. It achieves this by integrating specific kernels from the aiter library and introducing shared buffers for routing results, leading to a reported increase in request throughput from 1.88 req/s to 1.96 req/s.
Highlights
- Performance Optimization: Introduced support for the
aiter.biased_grouped_topkkernel to improve the performance of thebiased_group_topkoperation, specifically for HIP (AMD) platforms. - Aiter Fused MoE Integration: Integrated the
aiter.fused_moekernel for FP8 block quantization within the MoE layer processing whenaiteris enabled. - Shared Routing Buffers: Implemented a mechanism (
AiterTopKRoutingBuffers) to create and share top-k routing result buffers (non_shared_topk_ids,non_shared_topk_weights) across all MoE layers when using theaiterfused MoE implementation, aiming to reduce memory allocation overhead. - Token Limit Check: Added an assertion in the model's forward pass to check if the number of tokens exceeds the maximum capacity of the shared
aiterrouting buffers, providing a clear error message and suggestion to disableSGLANG_USE_AITERif the limit is hit.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization for the biased_group_topk operation by integrating an aiter implementation, primarily targeting HIP environments. This change has demonstrated a performance increase according to the provided benchmarks. The core modifications involve new buffer management (AiterTopKRoutingBuffers) for aiter and updates to the MoE expert selection and fusion logic within fp8.py.
The review identified a few areas for potential improvement:
- In
python/sglang/srt/utils.py, theAiterTopKRoutingBuffersconstructor includes unused parameters (n_routed_experts,n_shared_experts) that could be removed. - In
python/sglang/srt/layers/quantization/fp8.py, a TODO comment regarding activation support seems outdated given the new code, and a large block of commented-out code (related to the previousasm_moeimplementation) could be removed for clarity.
Overall, the changes align with the goal of performance enhancement and seem well-structured.
|
@kkHuang-amd please cross-check #7279 for the 2nd enhancement. |
|
|
||
|
|
||
| class AiterTopKRoutingBuffers: | ||
| MAX_NUM_TOKENS: int = 4096 * 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on this magic numbers
| assert ( | ||
| num_tokens <= AiterTopKRoutingBuffers.MAX_NUM_TOKENS | ||
| ), f"num_tokens {num_tokens} exceeds MAX_NUM_TOKENS {AiterTopKRoutingBuffers.MAX_NUM_TOKENS}. Consider disabling SGLANG_USE_AITER" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fail safe vs. assert, over num of tokens size limit?
| routed_scaling_factor: Optional[float] = None, | ||
| ) -> torch.Tensor: | ||
| from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_experts | ||
| from sglang.srt.layers.moe.topk import select_experts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move importing the select_experts close to the usage?
|
|
||
| if _is_hip: | ||
| from aiter import ActivationType, QuantType | ||
| from aiter import ActivationType, QuantType, biased_grouped_topk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the import of biased_grouped_topk close to the usage?
| if ( | ||
| _use_aiter | ||
| and correction_bias is not None | ||
| and hasattr(layer, "non_shared_topk_weights") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If the hasattr manifests time penalty, a caching mechanism could help here. I mean the double check at every iteration can affect some performance.
| from sglang.srt.layers.moe.topk import select_experts | ||
|
|
||
| # Expert selection | ||
| topk_weights, topk_ids = select_experts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_experts calls biased_grouped_topk at
Could you please try changing the parameters such that we do not need an external call of a function?
Motivation
Improve biased_group_topk kernel performance by using aiter implementation
New throughput number increased from 1.88 req/s to 1.96 req/s
Modifications
Checklist