Skip to content

Return expert routing info to support MoE routing replay#9499

Open
KawaiiNotHawaii wants to merge 7 commits intosgl-project:mainfrom
KawaiiNotHawaii:cyrus/routing-replay
Open

Return expert routing info to support MoE routing replay#9499
KawaiiNotHawaii wants to merge 7 commits intosgl-project:mainfrom
KawaiiNotHawaii:cyrus/routing-replay

Conversation

@KawaiiNotHawaii
Copy link
Copy Markdown

Motivation

Today SGLang can record MoE expert utilization, but it’s hard to attribute per-token routing decisions back to individual user requests when batching, prefill, and decode are interleaved. For training (e.g., GRPO), debugging, and routing analysis, we need a precise 1:1 mapping from recorded tokens to the originating request id (RID), and we want that routing info returned with the response. In this way, training MoE models with routing replay can be implemented easily with SGLang.
This PR enables SGLang to return per-token expert routing in meta_info along with the responses generated.

Modifications

  • Add a new return_expert_routing arg to the async_generate() function signature.
  • Add rids in forwardbatch so that recorders can get access to it and later re-map to each request accurately.
  • This function can be easily extended to more entropy points apart from the async_generate().

Accuracy Tests

No numerical kernel changes; model outputs are unaffected unless return_expert_routing=True is requested, in which case only metadata is added.

Benchmarking and Profiling

We are using the existing recorder, and change the dtype of the expert routing info matrix from int32 to uint8 to further optimize the memory usage. So nothing should slow down the recorder in any manner.

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @KawaiiNotHawaii, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances SGLang's capability to provide detailed per-token expert routing information for Mixture-of-Experts (MoE) models. By enabling the return of this metadata alongside generated responses, it facilitates precise attribution of routing decisions to individual user requests, which is crucial for training, debugging, and in-depth analysis of MoE model behavior, especially in batched inference scenarios.

Highlights

  • Expert Routing Data Collection: Introduced new logic to collect and process per-token expert routing decisions, allowing them to be mapped back to individual request IDs (RIDs).
  • API Extension: The async_generate() function now accepts a return_expert_routing argument, enabling users to request this detailed metadata.
  • Memory Optimization: The data type for storing expert routing information has been optimized from int32 to uint8 to reduce memory footprint.
  • In-Memory Data Access: Added functionality to retrieve expert distribution records directly as Python objects, bypassing file I/O for more efficient data handling.
  • Concurrency Control: Implemented an asyncio.Lock to ensure safe and synchronized access during expert routing data collection.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for Mixture-of-Experts (MoE) models by enabling the return of expert routing information. The implementation is well-structured, correctly handling asynchronous operations, data parallelism, and different generation stages. The changes are consistent across various modules, from the entrypoint engine to the scheduler and data structures. I have a few suggestions for code cleanup and improving robustness, but overall, this is a solid contribution.

Comment on lines +372 to +374
new = torch.full((L, need_tokens, K), -1,
dtype=self._topk_ids_of_layer.dtype,
device=self._topk_ids_of_layer.device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using torch.full(..., -1, ...) with dtype=torch.uint8 will cause the fill value to wrap around to 255. If 255 is a valid expert ID, this could lead to subtle bugs. Additionally, uint8 limits the number of experts to 255. It would be safer to add an assertion in _DetailSinglePassGatherer.__init__ to ensure the number of experts is within the valid range for uint8 and use a fill value that is guaranteed not to be a valid expert ID, such as torch.iinfo(torch.uint8).max.

Suggested change
new = torch.full((L, need_tokens, K), -1,
dtype=self._topk_ids_of_layer.dtype,
device=self._topk_ids_of_layer.device)
new = torch.full((L, need_tokens, K), torch.iinfo(self._topk_ids_of_layer.dtype).max,
dtype=self._topk_ids_of_layer.dtype,
device=self._topk_ids_of_layer.device)

Comment on lines +418 to +451
# per_req = _split_routing_per_request(records_obj)
# Attach under meta_info; consumers can ignore if unused
# ret.setdefault("meta_info", {})["moe_routing_per_request"] = {
# rid: {
# "topk_ids_of_layer": v["topk_ids_of_layer"].tolist(),
# "positions": v["positions"],
# "physical_to_logical_map": v["physical_to_logical_map"].cpu().tolist(),
# }
# for rid, v in per_req.items()
# }

# Determine which RIDs we actually need to attach
if isinstance(ret, list):
wanted = {item.get("meta_info", {}).get("id") for item in ret if item.get("meta_info")}
else:
wanted = {ret.get("meta_info", {}).get("id")} if ret.get("meta_info") else set()
wanted = {rid for rid in wanted if rid} # drop Nones

# breakpoint()
# Fast single-pass subset aggregation
per_rid = _records_to_per_rids_subset(
records_obj,
wanted_rids=wanted,
allow_multi_active_sequences=False, # TODO flip to True only if you explicitly support fan-out
)

_attach_routing_to_ret(ret, per_rid)
# breakpoint()

# # TODO DEBUGGING ONLY
# breakpoint() # TODO NOTE what if one prompt reaches EOS. records-1 or simply padding token id in the input_ids, where we can try using the rwo_idx to match back
# # TODO MOVE per req all to cpu
# ret.setdefault("meta_info", {})["moe_routing_per_request"] = per_req

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several commented-out blocks of code, including breakpoint() calls and old implementation logic. These should be removed to improve code clarity and maintainability.

Comment on lines +455 to +458
try:
await self.tokenizer_manager.stop_expert_distribution_record()
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Swallowing exceptions with a pass statement can hide important errors and make debugging difficult. It's better to at least log the exception to be aware of potential issues during the execution of stop_expert_distribution_record.

Suggested change
try:
await self.tokenizer_manager.stop_expert_distribution_record()
except Exception:
pass
try:
await self.tokenizer_manager.stop_expert_distribution_record()
except Exception as e:
logger.warning(f"Error stopping expert distribution record: {e}")

Comment on lines +749 to +758
# def dump(self, output_mode: _OutputMode):
# assert output_mode == "file"
# output = dict(
# records=self._records,
# # NOTE: This may change during recording, so here we say it is the "last" one
# last_physical_to_logical_map=self._expert_location_metadata.physical_to_logical_map,
# )
# _dump_to_file(
# f"expert_distribution_recorder_{time.time()}_{self._rank}.pt", output
# )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out implementation of the dump method should be removed to keep the codebase clean and avoid confusion.

# success/failure for the op
success: bool
# optional details
message: str = "" # message: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The commented-out type hint seems to be a leftover from development and should be removed for code cleanliness.

Suggested change
message: str = "" # message: Optional[str] = None
message: str = ""

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Aug 22, 2025

Today SGLang can record MoE expert utilization, but it’s hard to attribute per-token routing decisions back to individual user requests when batching, prefill, and decode are interleaved.

hmm I think I have written sth like that (for EPLB simlulator). I will check later.

@fzyzcjy fzyzcjy self-assigned this Aug 22, 2025
@KawaiiNotHawaii
Copy link
Copy Markdown
Author

Today SGLang can record MoE expert utilization, but it’s hard to attribute per-token routing decisions back to individual user requests when batching, prefill, and decode are interleaved.

hmm I think I have written sth like that (for EPLB simlulator). I will check later.

Hi, is this pr still able to be merged?

@lizipao
Copy link
Copy Markdown

lizipao commented Oct 23, 2025

@KawaiiNotHawaii @fzyzcjy
Hi
I have a couple of questions:
What is the current status of this PR? Is it planned to be merged soon?
Implementing routing-replay in Verl requires this feature,does SGLang now have an alternative, native solution for returning expert routing through this new implementation?

@KawaiiNotHawaii
Copy link
Copy Markdown
Author

@KawaiiNotHawaii @fzyzcjy Hi I have a couple of questions: What is the current status of this PR? Is it planned to be merged soon? Implementing routing-replay in Verl requires this feature,does SGLang now have an alternative, native solution for returning expert routing through this new implementation?

The PR is still open and waiting to be merged, but is already in industrial use by myself. Any help on refining this PR would be welcome tho!

@Cesilina
Copy link
Copy Markdown

Cesilina commented Dec 3, 2025

have you try multi-nodes training with rl? Is it ok?

@KawaiiNotHawaii
Copy link
Copy Markdown
Author

have you try multi-nodes training with rl? Is it ok?

Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution.

@Cesilina
Copy link
Copy Markdown

Cesilina commented Dec 3, 2025

have you try multi-nodes training with rl? Is it ok?

Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution.

It is weired. I try to add your commits in my local sgl. while return_expert_routing is true, topk_ids_of_layer is following result:

  1. model config topk is 6, 4, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]], 'shape': {'num_layers': 27, 'num_tokens': 414, 'top_k': 8}}
    2.], [24, 29, 36, 52, 56, 54, 255, 255], [11, 17, 49, 57, 14, 60, 255, 255]]], 'positions': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413],
    3., [0, 26, 32, 59, 63, 42, 255, 255], [35, 38, 40, 53, 58, 5, 255, 255], [4, 15, 17, 32, 54, 37, 255, 255], [7, 13, 42, 48, 61, 55, 255, 255], [0, 26, 34, 59, 63, 62, 255, 255], [17, 33, 35, 53, 56, 45, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 26, 30, 34, 36, 24, 255, 255], [10, 19, 23, 28, 35, 51, 255, 255], [2, 10, 14, 16, 35, 40, 255, 255], [6, 31, 32, 51, 60, 19, 255, 255], [14, 16, 35, 43, 58, 7, 255, 255], [19, 23, 27, 51, 52, 12, 255, 255], [9, 26, 35, 61, 63, 33, 255, 255], [0, 26, 34, 59, 63, 42, 255, 255], [16, 20, 35, 43, 58, 38, 255, 255], [9, 23, 26, 35, 54, 41, 255, 255], [29, 35, 39, 43, 62, 0, 255, 255], [35, 38, 40, 43, 58, 16, 255, 255], [3, 18, 31, 36, 54, 17, 255, 255], [3, 7, 33, 37, 61, 44, 255, 255], [11, 27, 39, 41, 57, 42, 255, 255], [35, 43, 44, 45, 46, 5, 255, 255], [9, 26, 45, 54, 63, 28, 255, 255], [5, 11, 57, 59, 62, 44, 255, 255], [0, 29, 40, 43, 46, 62, 255, 255], [38, 40, 53, 56, 58, 46, 255, 255], [32, 36, 52, 54, 56, 18, 255, 255], [7, 37, 48, 55, 61, 34, 255, 255], [5, 7, 21, 22, 37, 33, 255, 255], [35, 45, 53, 56, 58, 59, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 24, 26, 30, 36, 3, 255, 255], [3, 6, 31, 32, 44, 5, 255, 255], [7, 10, 25, 43, 55, 33, 255, 255], [17, 24, 33, 54, 56, 52, 255, 255], [9, 24, 26, 30, 52, 27, 255, 255], [10, 40, 44, 46, 48, 18, 255, 255], [8, 28, 35, 45, 61, 49, 255, 255], this seems normal,but except 255

@KawaiiNotHawaii
Copy link
Copy Markdown
Author

have you try multi-nodes training with rl? Is it ok?

Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution.

It is weired. I try to add your commits in my local sgl. while return_expert_routing is true, topk_ids_of_layer is following result:

  1. model config topk is 6, 4, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]], 'shape': {'num_layers': 27, 'num_tokens': 414, 'top_k': 8}}
    2.], [24, 29, 36, 52, 56, 54, 255, 255], [11, 17, 49, 57, 14, 60, 255, 255]]], 'positions': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413],
    3., [0, 26, 32, 59, 63, 42, 255, 255], [35, 38, 40, 53, 58, 5, 255, 255], [4, 15, 17, 32, 54, 37, 255, 255], [7, 13, 42, 48, 61, 55, 255, 255], [0, 26, 34, 59, 63, 62, 255, 255], [17, 33, 35, 53, 56, 45, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 26, 30, 34, 36, 24, 255, 255], [10, 19, 23, 28, 35, 51, 255, 255], [2, 10, 14, 16, 35, 40, 255, 255], [6, 31, 32, 51, 60, 19, 255, 255], [14, 16, 35, 43, 58, 7, 255, 255], [19, 23, 27, 51, 52, 12, 255, 255], [9, 26, 35, 61, 63, 33, 255, 255], [0, 26, 34, 59, 63, 42, 255, 255], [16, 20, 35, 43, 58, 38, 255, 255], [9, 23, 26, 35, 54, 41, 255, 255], [29, 35, 39, 43, 62, 0, 255, 255], [35, 38, 40, 43, 58, 16, 255, 255], [3, 18, 31, 36, 54, 17, 255, 255], [3, 7, 33, 37, 61, 44, 255, 255], [11, 27, 39, 41, 57, 42, 255, 255], [35, 43, 44, 45, 46, 5, 255, 255], [9, 26, 45, 54, 63, 28, 255, 255], [5, 11, 57, 59, 62, 44, 255, 255], [0, 29, 40, 43, 46, 62, 255, 255], [38, 40, 53, 56, 58, 46, 255, 255], [32, 36, 52, 54, 56, 18, 255, 255], [7, 37, 48, 55, 61, 34, 255, 255], [5, 7, 21, 22, 37, 33, 255, 255], [35, 45, 53, 56, 58, 59, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 24, 26, 30, 36, 3, 255, 255], [3, 6, 31, 32, 44, 5, 255, 255], [7, 10, 25, 43, 55, 33, 255, 255], [17, 24, 33, 54, 56, 52, 255, 255], [9, 24, 26, 30, 52, 27, 255, 255], [10, 40, 44, 46, 48, 18, 255, 255], [8, 28, 35, 45, 61, 49, 255, 255], this seems normal,but except 255

It has only been tested on Qwen models. But Deepseek V2 has shared experts which I guess may cause the expert recorder not functioning in an expected way. Can you try it on Qwen 30B A3B first?

@Cesilina
Copy link
Copy Markdown

Cesilina commented Dec 3, 2025

have you try multi-nodes training with rl? Is it ok?

Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution.

It is weired. I try to add your commits in my local sgl. while return_expert_routing is true, topk_ids_of_layer is following result:

  1. model config topk is 6, 4, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]], 'shape': {'num_layers': 27, 'num_tokens': 414, 'top_k': 8}}
    2.], [24, 29, 36, 52, 56, 54, 255, 255], [11, 17, 49, 57, 14, 60, 255, 255]]], 'positions': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413],
    3., [0, 26, 32, 59, 63, 42, 255, 255], [35, 38, 40, 53, 58, 5, 255, 255], [4, 15, 17, 32, 54, 37, 255, 255], [7, 13, 42, 48, 61, 55, 255, 255], [0, 26, 34, 59, 63, 62, 255, 255], [17, 33, 35, 53, 56, 45, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 26, 30, 34, 36, 24, 255, 255], [10, 19, 23, 28, 35, 51, 255, 255], [2, 10, 14, 16, 35, 40, 255, 255], [6, 31, 32, 51, 60, 19, 255, 255], [14, 16, 35, 43, 58, 7, 255, 255], [19, 23, 27, 51, 52, 12, 255, 255], [9, 26, 35, 61, 63, 33, 255, 255], [0, 26, 34, 59, 63, 42, 255, 255], [16, 20, 35, 43, 58, 38, 255, 255], [9, 23, 26, 35, 54, 41, 255, 255], [29, 35, 39, 43, 62, 0, 255, 255], [35, 38, 40, 43, 58, 16, 255, 255], [3, 18, 31, 36, 54, 17, 255, 255], [3, 7, 33, 37, 61, 44, 255, 255], [11, 27, 39, 41, 57, 42, 255, 255], [35, 43, 44, 45, 46, 5, 255, 255], [9, 26, 45, 54, 63, 28, 255, 255], [5, 11, 57, 59, 62, 44, 255, 255], [0, 29, 40, 43, 46, 62, 255, 255], [38, 40, 53, 56, 58, 46, 255, 255], [32, 36, 52, 54, 56, 18, 255, 255], [7, 37, 48, 55, 61, 34, 255, 255], [5, 7, 21, 22, 37, 33, 255, 255], [35, 45, 53, 56, 58, 59, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 24, 26, 30, 36, 3, 255, 255], [3, 6, 31, 32, 44, 5, 255, 255], [7, 10, 25, 43, 55, 33, 255, 255], [17, 24, 33, 54, 56, 52, 255, 255], [9, 24, 26, 30, 52, 27, 255, 255], [10, 40, 44, 46, 48, 18, 255, 255], [8, 28, 35, 45, 61, 49, 255, 255], this seems normal,but except 255

It has only been tested on Qwen models. But Deepseek V2 has shared experts which I guess may cause the expert recorder not functioning in an expected way. Can you try it on Qwen 30B A3B first?

ok . I will try.

@KawaiiNotHawaii
Copy link
Copy Markdown
Author

have you try multi-nodes training with rl? Is it ok?

Yes and it works. But one caveat is that when the batch size is larger than 32, the expert recorder can randomly drop some token's expert distribution.

It is weired. I try to add your commits in my local sgl. while return_expert_routing is true, topk_ids_of_layer is following result:

  1. model config topk is 6, 4, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]], 'shape': {'num_layers': 27, 'num_tokens': 414, 'top_k': 8}}
    2.], [24, 29, 36, 52, 56, 54, 255, 255], [11, 17, 49, 57, 14, 60, 255, 255]]], 'positions': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413],
    3., [0, 26, 32, 59, 63, 42, 255, 255], [35, 38, 40, 53, 58, 5, 255, 255], [4, 15, 17, 32, 54, 37, 255, 255], [7, 13, 42, 48, 61, 55, 255, 255], [0, 26, 34, 59, 63, 62, 255, 255], [17, 33, 35, 53, 56, 45, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 26, 30, 34, 36, 24, 255, 255], [10, 19, 23, 28, 35, 51, 255, 255], [2, 10, 14, 16, 35, 40, 255, 255], [6, 31, 32, 51, 60, 19, 255, 255], [14, 16, 35, 43, 58, 7, 255, 255], [19, 23, 27, 51, 52, 12, 255, 255], [9, 26, 35, 61, 63, 33, 255, 255], [0, 26, 34, 59, 63, 42, 255, 255], [16, 20, 35, 43, 58, 38, 255, 255], [9, 23, 26, 35, 54, 41, 255, 255], [29, 35, 39, 43, 62, 0, 255, 255], [35, 38, 40, 43, 58, 16, 255, 255], [3, 18, 31, 36, 54, 17, 255, 255], [3, 7, 33, 37, 61, 44, 255, 255], [11, 27, 39, 41, 57, 42, 255, 255], [35, 43, 44, 45, 46, 5, 255, 255], [9, 26, 45, 54, 63, 28, 255, 255], [5, 11, 57, 59, 62, 44, 255, 255], [0, 29, 40, 43, 46, 62, 255, 255], [38, 40, 53, 56, 58, 46, 255, 255], [32, 36, 52, 54, 56, 18, 255, 255], [7, 37, 48, 55, 61, 34, 255, 255], [5, 7, 21, 22, 37, 33, 255, 255], [35, 45, 53, 56, 58, 59, 255, 255], [9, 26, 30, 54, 56, 24, 255, 255], [9, 24, 26, 30, 36, 3, 255, 255], [3, 6, 31, 32, 44, 5, 255, 255], [7, 10, 25, 43, 55, 33, 255, 255], [17, 24, 33, 54, 56, 52, 255, 255], [9, 24, 26, 30, 52, 27, 255, 255], [10, 40, 44, 46, 48, 18, 255, 255], [8, 28, 35, 45, 61, 49, 255, 255], this seems normal,but except 255

It has only been tested on Qwen models. But Deepseek V2 has shared experts which I guess may cause the expert recorder not functioning in an expected way. Can you try it on Qwen 30B A3B first?

ok . I will try.

It seems you always have 2 slots of 255. 255 is the default expert index prefilled. also, deepseek v2 has 2 shared experts if I remember correctly. So I guess those two shared experts are not recorded correctly with the SGLang expert distribution recorder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants