Return expert routing info to support MoE routing replay#9499
Return expert routing info to support MoE routing replay#9499KawaiiNotHawaii wants to merge 7 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @KawaiiNotHawaii, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances SGLang's capability to provide detailed per-token expert routing information for Mixture-of-Experts (MoE) models. By enabling the return of this metadata alongside generated responses, it facilitates precise attribution of routing decisions to individual user requests, which is crucial for training, debugging, and in-depth analysis of MoE model behavior, especially in batched inference scenarios.
Highlights
- Expert Routing Data Collection: Introduced new logic to collect and process per-token expert routing decisions, allowing them to be mapped back to individual request IDs (RIDs).
- API Extension: The
async_generate()function now accepts areturn_expert_routingargument, enabling users to request this detailed metadata. - Memory Optimization: The data type for storing expert routing information has been optimized from
int32touint8to reduce memory footprint. - In-Memory Data Access: Added functionality to retrieve expert distribution records directly as Python objects, bypassing file I/O for more efficient data handling.
- Concurrency Control: Implemented an
asyncio.Lockto ensure safe and synchronized access during expert routing data collection.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for Mixture-of-Experts (MoE) models by enabling the return of expert routing information. The implementation is well-structured, correctly handling asynchronous operations, data parallelism, and different generation stages. The changes are consistent across various modules, from the entrypoint engine to the scheduler and data structures. I have a few suggestions for code cleanup and improving robustness, but overall, this is a solid contribution.
| new = torch.full((L, need_tokens, K), -1, | ||
| dtype=self._topk_ids_of_layer.dtype, | ||
| device=self._topk_ids_of_layer.device) |
There was a problem hiding this comment.
Using torch.full(..., -1, ...) with dtype=torch.uint8 will cause the fill value to wrap around to 255. If 255 is a valid expert ID, this could lead to subtle bugs. Additionally, uint8 limits the number of experts to 255. It would be safer to add an assertion in _DetailSinglePassGatherer.__init__ to ensure the number of experts is within the valid range for uint8 and use a fill value that is guaranteed not to be a valid expert ID, such as torch.iinfo(torch.uint8).max.
| new = torch.full((L, need_tokens, K), -1, | |
| dtype=self._topk_ids_of_layer.dtype, | |
| device=self._topk_ids_of_layer.device) | |
| new = torch.full((L, need_tokens, K), torch.iinfo(self._topk_ids_of_layer.dtype).max, | |
| dtype=self._topk_ids_of_layer.dtype, | |
| device=self._topk_ids_of_layer.device) |
| # per_req = _split_routing_per_request(records_obj) | ||
| # Attach under meta_info; consumers can ignore if unused | ||
| # ret.setdefault("meta_info", {})["moe_routing_per_request"] = { | ||
| # rid: { | ||
| # "topk_ids_of_layer": v["topk_ids_of_layer"].tolist(), | ||
| # "positions": v["positions"], | ||
| # "physical_to_logical_map": v["physical_to_logical_map"].cpu().tolist(), | ||
| # } | ||
| # for rid, v in per_req.items() | ||
| # } | ||
|
|
||
| # Determine which RIDs we actually need to attach | ||
| if isinstance(ret, list): | ||
| wanted = {item.get("meta_info", {}).get("id") for item in ret if item.get("meta_info")} | ||
| else: | ||
| wanted = {ret.get("meta_info", {}).get("id")} if ret.get("meta_info") else set() | ||
| wanted = {rid for rid in wanted if rid} # drop Nones | ||
|
|
||
| # breakpoint() | ||
| # Fast single-pass subset aggregation | ||
| per_rid = _records_to_per_rids_subset( | ||
| records_obj, | ||
| wanted_rids=wanted, | ||
| allow_multi_active_sequences=False, # TODO flip to True only if you explicitly support fan-out | ||
| ) | ||
|
|
||
| _attach_routing_to_ret(ret, per_rid) | ||
| # breakpoint() | ||
|
|
||
| # # TODO DEBUGGING ONLY | ||
| # breakpoint() # TODO NOTE what if one prompt reaches EOS. records-1 or simply padding token id in the input_ids, where we can try using the rwo_idx to match back | ||
| # # TODO MOVE per req all to cpu | ||
| # ret.setdefault("meta_info", {})["moe_routing_per_request"] = per_req | ||
|
|
| try: | ||
| await self.tokenizer_manager.stop_expert_distribution_record() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Swallowing exceptions with a pass statement can hide important errors and make debugging difficult. It's better to at least log the exception to be aware of potential issues during the execution of stop_expert_distribution_record.
| try: | |
| await self.tokenizer_manager.stop_expert_distribution_record() | |
| except Exception: | |
| pass | |
| try: | |
| await self.tokenizer_manager.stop_expert_distribution_record() | |
| except Exception as e: | |
| logger.warning(f"Error stopping expert distribution record: {e}") |
| # def dump(self, output_mode: _OutputMode): | ||
| # assert output_mode == "file" | ||
| # output = dict( | ||
| # records=self._records, | ||
| # # NOTE: This may change during recording, so here we say it is the "last" one | ||
| # last_physical_to_logical_map=self._expert_location_metadata.physical_to_logical_map, | ||
| # ) | ||
| # _dump_to_file( | ||
| # f"expert_distribution_recorder_{time.time()}_{self._rank}.pt", output | ||
| # ) |
| # success/failure for the op | ||
| success: bool | ||
| # optional details | ||
| message: str = "" # message: Optional[str] = None |
hmm I think I have written sth like that (for EPLB simlulator). I will check later. |
Hi, is this pr still able to be merged? |
|
@KawaiiNotHawaii @fzyzcjy |
The PR is still open and waiting to be merged, but is already in industrial use by myself. Any help on refining this PR would be welcome tho! |
|
have you try multi-nodes training with rl? Is it ok? |
Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution. |
It is weired. I try to add your commits in my local sgl. while return_expert_routing is true, topk_ids_of_layer is following result:
|
It has only been tested on Qwen models. But Deepseek V2 has shared experts which I guess may cause the expert recorder not functioning in an expected way. Can you try it on Qwen 30B A3B first? |
ok . I will try. |
It seems you always have 2 slots of 255. 255 is the default expert index prefilled. also, deepseek v2 has 2 shared experts if I remember correctly. So I guess those two shared experts are not recorded correctly with the SGLang expert distribution recorder |
Motivation
Today SGLang can record MoE expert utilization, but it’s hard to attribute per-token routing decisions back to individual user requests when batching, prefill, and decode are interleaved. For training (e.g., GRPO), debugging, and routing analysis, we need a precise 1:1 mapping from recorded tokens to the originating request id (RID), and we want that routing info returned with the response. In this way, training MoE models with routing replay can be implemented easily with SGLang.
This PR enables SGLang to return per-token expert routing in meta_info along with the responses generated.
Modifications
return_expert_routingarg to theasync_generate()function signature.async_generate().Accuracy Tests
No numerical kernel changes; model outputs are unaffected unless return_expert_routing=True is requested, in which case only metadata is added.
Benchmarking and Profiling
We are using the existing recorder, and change the dtype of the expert routing info matrix from int32 to uint8 to further optimize the memory usage. So nothing should slow down the recorder in any manner.
Checklist