Skip to content

[Fix]Fix one-sided MoE padding sentinel for local expert maps#42034

Open
Kevin-XiongC wants to merge 2 commits into
vllm-project:mainfrom
Kevin-XiongC:fix-flashinfer-one-sided-padding-sentinel
Open

[Fix]Fix one-sided MoE padding sentinel for local expert maps#42034
Kevin-XiongC wants to merge 2 commits into
vllm-project:mainfrom
Kevin-XiongC:fix-flashinfer-one-sided-padding-sentinel

Conversation

@Kevin-XiongC
Copy link
Copy Markdown
Contributor

@Kevin-XiongC Kevin-XiongC commented May 8, 2026

Summary

Fix FlashInfer NVLink one-sided MoE dispatch padding when expert_map is present.

The one-sided all2all path pads each source rank up to runtime_max_tokens_per_rank. It previously used -1 as the invalid expert id. Some downstream kernels index expert_map before filtering padded tokens, so -1 can index the last expert-map entry and corrupt local-expert routing.

Without this fix, the Marlin MoE path can see padded -1 expert ids after dispatch and trigger an IMA when those ids are used through the local expert map. When an expert map is present, this PR uses a valid global expert id that is non-local to the current EP rank instead. Paths without expert_map keep using the out-of-range num_experts sentinel.

Source-level rationale

The original padding value is unreasonable for the expert_map path because it violates the downstream kernel contract:

  • FlashInferNVLinkOneSidedPrepareAndFinalize.prepare() sends topk_ids as one of the all2all payloads and passes invalid_token_expert_id to moe_alltoall.dispatch(). This means padded tokens can be materialized into the received topk_ids tensor.
  • EP expert_map is a global-expert-id to local-expert-id table with shape [global_num_experts]; non-local experts are represented by values of -1 inside the table. It is not designed to be indexed with a negative global expert id.
  • The CUDA MoE permute preprocess path used by expert-map-aware MoE kernels indexes the map before skipped/non-local tokens are filtered: preprocessTopkIdLauncher(..., expert_map_ptr, n_expert, ...) runs when expert_map is present, and preprocessTopkIdKernel does local_expert_idx = smem_expert_map[topk_id] for each topk_id.
  • Therefore, a padded topk_id == -1 is already invalid before later filtering can happen. It becomes expert_map[-1] in the kernel rather than "a skipped token", which can read the wrong entry or hit an illegal memory access. This is the path where Marlin MoE can IMA.

The replacement sentinel is a valid global expert id owned by another EP rank. It is safe to index into expert_map, maps to -1 for the local rank, and is then treated as non-local/skipped by the existing expert-map logic.

AI assistance was used to prepare this PR.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the FlashInferNVLinkOneSidedPrepareAndFinalize.prepare method to calculate a non-negative invalid_token_expert_id instead of using a hardcoded -1. This change ensures that padding tokens use an expert ID that is invalid for the local rank but non-negative, avoiding potential indexing issues with expert_map. Additionally, new unit tests were added to verify the correct calculation of this ID in different scenarios. I have no feedback to provide.

@Kevin-XiongC Kevin-XiongC force-pushed the fix-flashinfer-one-sided-padding-sentinel branch from 3c70b82 to 857b3d8 Compare May 8, 2026 07:13
@Kevin-XiongC Kevin-XiongC changed the title Fix one-sided MoE padding sentinel for local expert maps [Fix]Fix one-sided MoE padding sentinel for local expert maps May 8, 2026
…t#64)

Use a non-local valid expert id for one-sided dispatch padding when an expert map is present so downstream kernels that index expert_map before filtering do not see negative expert ids. Keep the out-of-range sentinel for paths without expert_map.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Kevin-XiongC <kevin_xiong1997@outlook.com>
@Kevin-XiongC Kevin-XiongC force-pushed the fix-flashinfer-one-sided-padding-sentinel branch from 857b3d8 to 634dca5 Compare May 8, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant