adapt ant moving to A2 single machine by luanyundu · Pull Request #362 · sgl-project/sgl-kernel-npu

luanyundu · 2026-02-03T08:37:29Z

Use DEEPEP_NORMAL_LONG_SEQ_ROUND, DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS, DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ to control wether enable ant moving. The code was already proved by 32k token in test_intranode.py and test_normal_and_low_latency.py.

gemini-code-assist · 2026-02-03T08:39:25Z

Summary of Changes

Hello @luanyundu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Mixture-of-Experts (MoE) 'ant moving' functionality by implementing a multi-round processing strategy. This adaptation allows the system to efficiently handle substantially larger batch sizes and sequence lengths on a single A2 machine, which was previously a limitation. The changes involve a comprehensive overhaul of the underlying data flow, memory management, and kernel dispatching to seamlessly manage token processing across multiple rounds.

Highlights

Multi-Round Processing for MoE Operations: Introduced a multi-round processing mechanism for Mixture-of-Experts (MoE) operations, specifically for cam_moe_combine_normal and cam_moe_dispatch_normal, to handle larger sequence lengths and batch sizes on a single A2 machine. This is controlled by new environment variables and tiling keys.
Increased Batch Size Support: The upper bound for batch size (BS_UPPER_BOUND) has been significantly increased from 8000/32768 to 65536 in both combine and dispatch tiling functions, enabling the processing of larger data volumes.
Refactored Data Handling and Memory Management: Extensive changes were made to data slicing, buffer allocation, and communication logic to accommodate multi-round processing. This includes new offset calculations, dynamic buffer sizing based on rounds, and increased memory offsets (COMBINE_STATE_WIN_OFFSET, NOTIFY_DISPATCH_WIN_OFFSET). A new kernel, CamMoeCombineNormalMultiRound, was added to implement the multi-round combine logic.
Semantic Renaming and API Updates: The input parameter topk_idx has been semantically renamed to token_idx across relevant files and API definitions to better reflect its role in token indexing. New output parameters like sendTokenIdxSmall were added to the dispatch_layout API.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/deepep/ops2/op_host/cam_moe_combine_normal.cpp
- Renamed input parameter topk_idx to token_idx in the kernel signature.
- Implemented conditional dispatch to CamMoeCombineNormalMultiRound or CamMoeCombineNormal based on a new tiling key for multi-round processing.
csrc/deepep/ops2/op_host/cam_moe_combine_normal_tiling.cc
- Removed internal Mc2TilingUtils class definition, now using a global utility.
- Renamed TOPK_IDX_INDEX to TOKEN_IDX_INDEX.
- Increased BS_UPPER_BOUND from 8000 to 65536.
- Added INIT_TILINGKEY and logic to set tilingKey based on maxRound for multi-round support.
- Adjusted HCCL_BUFFSIZE calculation to account for realBs and maxRound.
csrc/deepep/ops2/op_host/cam_moe_dispatch_normal_tiling.cc
- Removed internal Mc2TilingUtils class definition.
- Increased BS_UPPER_BOUND from 32768 to 65536.
- Adjusted HCCL_BUFFSIZE calculation for multi-round combine data.
csrc/deepep/ops2/op_host/dispatch_layout_tiling.cc
- Added OUTPUT_SEND_TOKEN_IDX_SMALL_INDEX to output parameters.
- Refactored Init and Process methods to support multi-round token distribution, including new members for round-specific offsets and counts.
csrc/deepep/ops2/op_host/mc2_tiling_utils.h
- Refactored Mc2TilingUtils class and AICPU_BLOCK_DIM_A2 constant to be globally accessible by removing namespace and license header.
csrc/deepep/ops2/op_host/moe_distribute_combine_v2_tiling.cc
- Updated usage of Mc2TilingUtils::GetMaxWindowSize() to reflect its global scope.
csrc/deepep/ops2/op_host/moe_distribute_dispatch_v2_tiling.cc
- Updated usage of Mc2TilingUtils::GetMaxWindowSize() and AICPU_BLOCK_DIM_A2 to reflect their global scope.
csrc/deepep/ops2/op_host/notify_dispatch_tiling.cc
- Removed internal Mc2TilingUtils class definition.
- Added totalWinSize to NotifyDispatchInfo struct.
csrc/deepep/ops2/op_host/op_api/aclnn_cam_moe_combine_normal.cpp
- Updated aclnnCamMoeCombineNormalGetWorkspaceSize function signature to use tokenIdx instead of topkIdx.
csrc/deepep/ops2/op_host/op_api/aclnn_cam_moe_combine_normal.h
- Updated parameter description and function signature to use tokenIdx.
csrc/deepep/ops2/op_host/op_api/aclnn_dispatch_layout.h
- Added sendTokenIdxSmall to the list of output parameters.
csrc/deepep/ops2/op_host/tiling_args.h
- Increased COMBINE_STATE_WIN_OFFSET from 3MB to 8MB.
- Increased NOTIFY_DISPATCH_WIN_OFFSET from 204MB to 404MB.
- Added STATE_SIZE constant (2MB).
csrc/deepep/ops2/op_kernel/cam_moe_combine_normal.cpp
- Included cam_moe_combine_normal_multi_round.h.
- Defined TILINGKEY_MULTI_ROUND and TILINGKEY_SINGLE_ROUND.
- Modified kernel to conditionally dispatch to multi-round or single-round implementation based on tiling key.
- Renamed topkIdx to tokenIdx in kernel arguments.
csrc/deepep/ops2/op_kernel/cam_moe_combine_normal.h
- Increased COMBINE_STATE_WIN_OFFSET from 3MB to 8MB.
- Renamed topkIdx to tokenIdx in method signatures and member variables.
- Added totalWinSize_ member and updated InitMagic to use it.
- Replaced topkIdxBuf_ with tokenIdxBuf_.
- Adjusted stateBuf_ tensor allocation and winDataSizeOffset_ calculation.
- Added SyncFunc<AscendC::HardEvent::S_MTE3>(); for synchronization.
csrc/deepep/ops2/op_kernel/cam_moe_combine_normal_multi_round.h
- New file: Implements the CamMoeCombineNormalMultiRound class, providing the core logic for multi-round token combining, including initialization, buffer management, and status synchronization across rounds.
csrc/deepep/ops2/op_kernel/cam_moe_combine_normal_tiling.h
- Removed a redundant comment.
csrc/deepep/ops2/op_kernel/cam_moe_dispatch_normal.cpp
- Modified Init calls for CamMoeDispatchNormal to pass new offset arguments (expert_global_offset, srcrank_in_expert_offset, r_in_srcrank_offset).
csrc/deepep/ops2/op_kernel/cam_moe_dispatch_normal.h
- Increased COMBINE_STATE_WIN_OFFSET from 3MB to 8MB.
- Added ROUND_STATE_OFFSET and FLOAT_NUM_PER_ALIGN constants.
- Introduced new members (realMaxBatchSize, round, perRoundTokens, totalWinSize_, roundIndex, expertGlobalOffsetGT, srcrankInExpertOffsetGT, rInSrcrankOffsetGT, dstRoundStatusGT) for multi-round processing.
- Added new methods: GetRoundStateAddrByRankId, SetRoundStatus, WaitRoundStatus, and ShareToOutputLongSeq.
- Refactored InputToShare to adjust expertIdsCnt based on current round.
- Refactored ShareToOutput into ShareToOutputLongSeq with updated logic for writing data using new offset tensors.
- Implemented a multi-round processing loop in the Process method.
csrc/deepep/ops2/op_kernel/check_winsize.h
- Removed license header.
- Corrected array index from exceptionLocal[0] to exceptionLocal[1] in DataCopy.
csrc/deepep/ops2/op_kernel/comm_args.h
- Decreased NOTIFY_DISPATCH_BUFF_OFFSET from 404MB to 202MB.
- Added ROUND_STATE_MAX_SIZE (4KB) and BASE_ROUND_STATE_OFFSET (450KB).
csrc/deepep/ops2/op_kernel/dispatch_layout.h
- Added ONE_PIECE constant.
- Modified Init and Process to handle multi-round token distribution, including new members for round-specific offsets and counts, and iterating through rounds.
csrc/deepep/ops2/op_kernel/notify_dispatch.cpp
- Removed specific tiling keys for float16, bfloat16, and float types, retaining only TILING_KEY_INT.
- Added round, perRoundTokens, and totalWinSize as local variables from tiling data.
csrc/deepep/ops2/op_kernel/notify_dispatch.h
- Updated KERNELS_ARGS_FUN_ALL2ALL and KERNELS_ARGS_CALL_ALL2ALL macros to include multi-round parameters (round, perRoundTokens, totalWinSize) and new offset arguments (expertGlobalOffset, srcrankInExpertOffset, rInSrcrankOffset).
- Adjusted MAX_RANK_PER_CORE to 8 and MULTI_RANK_SIZE to 48.
- Added new constexpr static int32_t for core indices and BATCH_ROUND.
- Introduced new GlobalTensor and TBuf members for multi-round offsets and data.
- Refactored AssembleSendData to handle multi-round data assembly using batch processing.
- Modified InputToShareSlice and ShareToShareSlice to use int32_t for copyOffset.
- Refactored output reordering functions (ReorderOutput, ReorderSendCountOutput, ReorderSendOffsetOutput) to accept rStart and currentBatchRounds parameters.
- Added ReorderSendTokensPerRankOutput function.
- Implemented BuildExpGlobalOffset, BuildsrcRankInExpOffset, and BuildRInSrcrankOffset for calculating new offset types.
- Updated BuildTotalRecvTokens, BuildRecvCount, BuildRecvOffset, BuildMaxBs, BuildRecvTokenPerExp to iterate over rounds and use batch processing.
- Modified InitSmallFullMesh to initialize new multi-round parameters and use totalWinSize_ for shared memory addresses.
csrc/deepep/ops2/op_kernel/notify_dispatch_tiling.h
- Added totalWinSize member to NotifyDispatchInfo struct.

Activity

No specific activity (comments, reviews, progress) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant changes to support multi-round processing for long sequences, referred to as "ant moving", on A2 single machines. The changes are extensive, touching operator definitions, tiling logic, and kernel implementations across numerous files. Key modifications include refactoring utility code into a shared header, updating operator APIs, and implementing new multi-round logic in the MoE kernels. A new kernel implementation for multi-round combine operation has been added. Overall, the changes appear to correctly implement the new feature. However, I've identified a critical issue related to potential memory corruption and a medium-severity issue regarding a logic mismatch in a check. Please see the detailed comments for suggestions.

csrc/deepep/ops2/op_kernel/check_winsize.h

csrc/deepep/ops2/op_host/cam_moe_combine_normal_tiling.cc

…n 256 Co-authored-by: WSEmma <wusemma@163.com>

* upstream/main: CI execution requirements for separating a2 and a3 (sgl-project#367) Fix the bug that total expert num greater than 256 or local expert num is less than 8 (sgl-project#364) adapt ant moving to A2 single machine (sgl-project#362)

…-npu into sgl-cmake2 * 'sgl-cmake2' of https://github.com/1329009851/sgl-kernel-npu: CI execution requirements for separating a2 and a3 (sgl-project#367) Fix the bug that total expert num greater than 256 or local expert num is less than 8 (sgl-project#364) adapt ant moving to A2 single machine (sgl-project#362) reset ci -- run test mixed running for experts on a2. (sgl-project#365) Revert "Build the deepep package with the chip model included. (sgl-project#274)" (sgl-project#363) fix:buffer control (sgl-project#361) Build the deepep package with the chip model included. (sgl-project#274) bugfix wrong packages build dir (sgl-project#360) bump version to 2026.02.01 (sgl-project#359) Cover the workflows cases on a3 (sgl-project#321) release follows naming convention (sgl-project#356) Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. (sgl-project#352) fix the hanging bug (sgl-project#355) [Bugfix] Fix build script working with cann 8.5.0 (sgl-project#354) Modify the description of DeepEP in the README file. (sgl-project#348) Revert "Add scripts for building CMake files (sgl-project#344)" (sgl-project#353) Add scripts for building CMake files (sgl-project#344) Support x86_64 and aarch64 binary release (sgl-project#325) add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)

* adapt ant moving to A2 single machine * fix CI bug that misalign when localExpertsNum less than 8 or more than 256 Co-authored-by: WSEmma <wusemma@163.com> --------- Co-authored-by: WSEmma <wusemma@163.com>

adapt ant moving to A2 single machine

d08cc83

luanyundu force-pushed the ant_a2 branch from 72f3cd8 to d08cc83 Compare February 3, 2026 08:39

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

csrc/deepep/ops2/op_kernel/check_winsize.h Show resolved Hide resolved

csrc/deepep/ops2/op_host/cam_moe_combine_normal_tiling.cc Show resolved Hide resolved

luanyundu force-pushed the ant_a2 branch from 782891d to e21bbf2 Compare February 4, 2026 07:55

fix CI bug that misalign when localExpertsNum less than 8 or more tha…

176b249

…n 256 Co-authored-by: WSEmma <wusemma@163.com>

luanyundu force-pushed the ant_a2 branch from e21bbf2 to 176b249 Compare February 5, 2026 02:31

Yael-X approved these changes Feb 5, 2026

View reviewed changes

Yael-X merged commit 98bc6f6 into sgl-project:main Feb 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt ant moving to A2 single machine#362

adapt ant moving to A2 single machine#362
Yael-X merged 2 commits intosgl-project:mainfrom
luanyundu:ant_a2

luanyundu commented Feb 3, 2026

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luanyundu commented Feb 3, 2026

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants