Skip to content

[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504

Merged
tjtanaa merged 10 commits intovllm-project:mainfrom
ROCm:akaratza_fix_legacy_routing
Apr 7, 2026
Merged

[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504
tjtanaa merged 10 commits intovllm-project:mainfrom
ROCm:akaratza_fix_legacy_routing

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 30, 2026

Three issues prevented test_gpt_oss_triton_kernels and GPQA serving from working on gfx950 (CDNA4) after the legacy routing deprecation:

  1. pack_bitmatrix sets bit 31 spuriously on HIP: Triton on HIP uses C-style integer division: -1 // 32 == 0 (not -1 as in Python). The masked-out lanes loaded with other=-1 produce div=0, and 1 << (-1 % 32) is undefined behavior that sets bit 31 on AMD. This causes expert 31 to appear in every bitmatrix row, corrupting the SparseMatrix routing metadata and dropping valid token-expert pairs. On the serving path this results in GPU memory access faults that crash the server. Fixed by adding a valid = indices >= 0 guard. On CUDA this is a no-op since Python-style floor division gives div=-1 which never matches any column offset.

  2. Test padding alignment too small for CDNA4: CDNA4MXScaleLayout.swizzle_data reshapes with SCALE_K // 8, requiring K % 256 == 0. The test used round_up(K, 64) which is only sufficient on Hopper (where HopperMXScaleLayout pads internally). Made alignment platform-conditional: ROCm uses 256/512 matching production values in mxfp4_round_up_hidden_size_and_intermediate_size; CUDA keeps the original 64/128.

  3. gfx950 GPQA eval configs missing tokenizer and TP: The amd/gpt-oss-20b-w-mxfp4-a-bf16 model repo declares tokenizer_class=TokenizersBackend which is not a valid HuggingFace tokenizer, causing server launch failure. Added --tokenizer openai/gpt-oss-20b. Also added --tensor-parallel-size 2 to all ROCm configs to prevent GPU memory faults with enforce-eager.

Root cause detail

pack_bitmatrix loads topk_ids with other=-1 for out-of-bounds lanes. On HIP/ROCm:

-1 (int16) // 32 = 0      # C-style truncation toward zero
-1 (int16) %  32 = -1     # C-style remainder
1 << -1          = 1 << 31 # undefined behavior, sets bit 31 on AMD

This silently inserts expert 31 into the bitmatrix for every token, which corrupts col_sum (expert 31 gets count n_rows instead of 0), causing col_sorted_indx to have -1 entries that drop valid token-expert pairs from the computation. During serving, these corrupted indices lead to out-of-bounds memory accesses that crash the GPU.

Test plan

  • test_gpt_oss_triton_kernels.py: 5/5 passed (was 1 passed, 4 failed)
  • GPQA accuracy eval (gfx950 MXFP4 configs): 3/3 passed (was 1 passed, 1 GPU crash, 1 not reached)

[LEGACY] Initial Test plan and Motivation

legacy_routing used legacy_routing_from_sparsematrix which extracted col_sorted_indx directly from the topk() SparseMatrix, producing indices that differ from the bitmatrix-reconstructed path (missing -1 padding for empty expert slots). In this PR we align legacy_routing with the same code path used by triton_kernel_moe_forward and make_routing_data: unpack topk() results and route through legacy_routing_from_bitmatrix (tuple/ROCm) or make_routing_data (SparseMatrix/NVIDIA).

Test plan

  • pytest tests/kernels/moe/test_gpt_oss_triton_kernels.py::test_legacy_routing -v

Motivation: https://buildkite.com/vllm/amd-ci/builds/7062/steps/canvas?jid=019d382d-cbac-41c4-a53f-f4f56244488e&tab=output

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the MoE routing logic to handle bitmatrix-based routing and introduces a mechanism in the scheduler to defer block freeing during asynchronous KV transfers. Specifically, it adds a _deferred_block_req_ids set to track these requests and ensures the engine continues stepping while transfers are pending. I have no feedback to provide.

@AndreasKaratzas AndreasKaratzas changed the title Akaratza fix legacy routing [Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path Mar 30, 2026
@AndreasKaratzas AndreasKaratzas added the rocm Related to AMD ROCm label Mar 30, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 30, 2026
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 30, 2026 17:11
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2026
Comment thread vllm/v1/core/sched/scheduler.py Outdated

def has_finished_requests(self) -> bool:
return len(self.finished_req_ids) > 0
return len(self.finished_req_ids) > 0 or len(self._deferred_block_req_ids) > 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes from the other PR #38503 ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, I reverted that.

if sm_first:
logits = torch.softmax(logits, dim=-1)
sparse_logits = topk(logits, n_expts_act, apply_softmax=not sm_first)
return legacy_routing_from_sparsematrix(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after you remove this legacy_routing_from_sparsematrix is not used any more. Let's remove it.

Moreover, please disclose the accuracy of the models.

I am not familiar with this part of the code.

CC @robertgshaw2-redhat @bnellnm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that legacy routing logic. I enabled also GPQA Eval tests too for GPT-OSS accuracy.

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Apr 4, 2026

Does it affect performance?

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

@tjtanaa Actually I was about to cite the tests regarding accuracy, but think I got to rebase first to get clean logs: https://buildkite.com/vllm/amd-ci/builds/7267/steps/canvas?sid=019d553e-778b-452a-bde4-8882ae4cdcf2&tab=output

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

GPQA accuracy (amd/gpt-oss-20b-w-mxfp4-a-bf16, TP=2, gfx950, threshold=0.568, tol=0.05)

Config main PR branch
mxfp4-bf16-aiter PASSED (0.5739) PASSED (0.5600)
mxfp4-bf16-triton PASSED (0.5701) PASSED (0.5726)
mxfp4-fp8-triton PASSED (0.5562) PASSED (0.5593)
baseline PASSED (0.5707) PASSED (0.5530)

Serving benchmark (amd/gpt-oss-20b-w-mxfp4-a-bf16, TP=2, gfx950, 512 prompts, rate=10)

1024 in / 1024 out

Metric main triton main aiter PR triton PR aiter
Successful / Failed requests 512 / 0 512 / 0 512 / 0 512 / 0
Output token throughput (tok/s) 392.62 540.17 467.22 527.19
Mean TTFT (ms) 69.90 38.70 80.01 37.85
Mean TPOT (ms) 109.03 54.68 123.46 54.23
Request throughput (req/s) 6.86 9.46 8.15 9.53

8192 in / 8192 out

Metric main triton main aiter PR triton PR aiter
Successful / Failed requests 512 / 0 512 / 0 512 / 0 512 / 0
Output token throughput (tok/s) 309.76 337.94 252.00 472.75
Mean TTFT (ms) 628.07 347.65 699.16 343.38
Mean TPOT (ms) 329.76 162.47 389.91 164.78
Request throughput (req/s) 4.05 4.63 4.04 7.26

Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Apr 7, 2026
@tjtanaa tjtanaa merged commit 2df2c85 into vllm-project:main Apr 7, 2026
66 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 7, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_fix_legacy_routing branch April 7, 2026 02:58
jacob-lou pushed a commit to jacob-lou/vllm that referenced this pull request Apr 7, 2026
…vllm-project#38504)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Jacob Lou <jacoblou0924@gmail.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…vllm-project#38504)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants