[Bugfix][RoCM] GPT-OSS + Expert Parallel by varun-sundar-rabindranath · Pull Request #35791 · vllm-project/vllm

varun-sundar-rabindranath · 2026-03-02T19:31:35Z

Purpose

On local source-build using python setup.py develop on RoCM, gpt-oss + expert-parallel segfault

Repro command: vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel

Why: When using expert-parallel, the invalid topk-ids are marked as -1. The bitmatrix construction kernel is supposed to ignore this. This is achieved by comparing topk_id // 32 with offs (which are all >= 0) into the bitmatrix at

vllm/vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py

Line 100 in 2a9e334

div[:, :, None] == offs[None, None, :], (one << rem)[:, :, None], 0

.
The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

Fix: The fix is to robustify the invalid_expert check by including the topk-ids directly in the masking operation.

Note:

I haven't tested this on Nvidia GPUs but I don't remember seeing this behaviour.
This PR also transposes the bitmatrix tensor for efficient access by the downstream OpenAI Triton kernels (Based on references in OpenAI triton_kernels)

Test Plan

vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel

eval : OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://localhost:8000/v1

Test Result

vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel starts up fine. i.e. no segfault

eval : [{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-low_temp1.0_20260302_192828', 'metric': 0.6527777777777778}]

varun-sundar-rabindranath · 2026-03-02T19:32:57Z

@zyongye @tjtanaa @mgoin @elizabetht PTAL! Thanks 🙌

gemini-code-assist

Code Review

This pull request addresses a critical bug causing segmentation faults on RoCM when using expert parallelism with GPT-OSS models. The fix correctly handles invalid expert IDs by adding an explicit check, which makes the implementation more robust and platform-independent. The pull request also includes a performance optimization by transposing the bitmatrix for more efficient memory access, along with a compatibility update for the Bitmatrix constructor. The changes are well-justified and correctly implemented.

mergify · 2026-03-02T19:40:14Z

Hi @varun-sundar-rabindranath, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

varun-sundar-rabindranath · 2026-03-03T15:01:36Z

-        (n_rows, bm_cols), dtype=torch.uint32, device=topk_ids.device
+        (bm_cols, triton.cdiv(n_rows, 32) * 32),
+        dtype=torch.uint32,
+        device=topk_ids.device,
    )


defensively use 32 directly. aliasing with BLOCK_SIZE_K will lead to a wrong definition of bitmatrix if BLOCK_SIZE_K changes.

tlrmchlsmth

Nice work tracking this down.

I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.

Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs:
https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

varun-sundar-rabindranath · 2026-03-03T20:47:03Z

Nice work tracking this down.

I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.

Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs: https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

Yes. I can check on CUDA 👍

varun-sundar-rabindranath · 2026-03-05T04:28:25Z

Nice work tracking this down.
I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.
Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs: https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

Yes. I can check on CUDA 👍

checked in CUDA - on CUDA -1 // 32 => -1 and 1 // 32 => 0, as assumed by the kernel. It is just on RoCM that the behaviour is different.

cc @tlrmchlsmth

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mergify · 2026-04-07T03:05:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2026-04-09T07:36:50Z

fixed in #38504

varun-sundar-rabindranath requested review from mgoin and pavanimajety as code owners March 2, 2026 19:31

mergify Bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm bug Something isn't working labels Mar 2, 2026

github-project-automation Bot added this to AMD and gpt-oss Issues & Enhancements Mar 2, 2026

github-project-automation Bot moved this to Todo in AMD Mar 2, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 2, 2026

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

varun-sundar-rabindranath commented Mar 3, 2026

View reviewed changes

tlrmchlsmth reviewed Mar 3, 2026

View reviewed changes

varun-sundar-rabindranath requested a review from tlrmchlsmth March 5, 2026 04:28

tlrmchlsmth approved these changes Mar 12, 2026

View reviewed changes

github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 12, 2026

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026

tlrmchlsmth enabled auto-merge (squash) March 12, 2026 20:51

auto-merge was automatically disabled March 16, 2026 20:48
Head branch was pushed to by a user without write access

varun-sundar-rabindranath force-pushed the varun/bitmatrix-fix branch from 728f33c to ab77b2d Compare March 16, 2026 20:48

Varun Sundar Rabindranath added 2 commits April 2, 2026 14:17

fix triton kernels

1adc2a4

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

fix lint

6ee270e

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/bitmatrix-fix branch from 7fce8ba to 6ee270e Compare April 2, 2026 14:20

mergify Bot added the needs-rebase label Apr 7, 2026

varun-sundar-rabindranath closed this Apr 9, 2026

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 9, 2026

github-project-automation Bot moved this from Todo to Done in AMD Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791

[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791
varun-sundar-rabindranath wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:varun/bitmatrix-fix

varun-sundar-rabindranath commented Mar 2, 2026 •

edited by github-actions Bot

Loading

Uh oh!

varun-sundar-rabindranath commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Mar 2, 2026

Uh oh!

varun-sundar-rabindranath Mar 3, 2026

Uh oh!

tlrmchlsmth left a comment •

edited

Loading

Uh oh!

varun-sundar-rabindranath commented Mar 3, 2026

Uh oh!

varun-sundar-rabindranath commented Mar 5, 2026

Uh oh!

mergify Bot commented Apr 7, 2026

Uh oh!

varun-sundar-rabindranath commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

varun-sundar-rabindranath commented Mar 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

varun-sundar-rabindranath commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Mar 2, 2026

Uh oh!

varun-sundar-rabindranath Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Mar 3, 2026

Uh oh!

varun-sundar-rabindranath commented Mar 5, 2026

Uh oh!

mergify Bot commented Apr 7, 2026

Uh oh!

varun-sundar-rabindranath commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varun-sundar-rabindranath commented Mar 2, 2026 •

edited by github-actions Bot

Loading

tlrmchlsmth left a comment •

edited

Loading