Skip to content

[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791

Closed
varun-sundar-rabindranath wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:varun/bitmatrix-fix
Closed

[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791
varun-sundar-rabindranath wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:varun/bitmatrix-fix

Conversation

@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Mar 2, 2026

Purpose

On local source-build using python setup.py develop on RoCM, gpt-oss + expert-parallel segfault

Repro command: vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel

Why: When using expert-parallel, the invalid topk-ids are marked as -1. The bitmatrix construction kernel is supposed to ignore this. This is achieved by comparing topk_id // 32 with offs (which are all >= 0) into the bitmatrix at

div[:, :, None] == offs[None, None, :], (one << rem)[:, :, None], 0
.
The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

Fix: The fix is to robustify the invalid_expert check by including the topk-ids directly in the masking operation.

Note:

  1. I haven't tested this on Nvidia GPUs but I don't remember seeing this behaviour.
  2. This PR also transposes the bitmatrix tensor for efficient access by the downstream OpenAI Triton kernels (Based on references in OpenAI triton_kernels)

Test Plan

vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel

eval : OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://localhost:8000/v1

Test Result

vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel starts up fine. i.e. no segfault

eval : [{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-low_temp1.0_20260302_192828', 'metric': 0.6527777777777778}]

@mergify mergify Bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm bug Something isn't working labels Mar 2, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 2, 2026
@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor Author

@zyongye @tjtanaa @mgoin @elizabetht PTAL! Thanks 🙌

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug causing segmentation faults on RoCM when using expert parallelism with GPT-OSS models. The fix correctly handles invalid expert IDs by adding an explicit check, which makes the implementation more robust and platform-independent. The pull request also includes a performance optimization by transposing the bitmatrix for more efficient memory access, along with a compatibility update for the Bitmatrix constructor. The changes are well-justified and correctly implemented.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 2, 2026

Hi @varun-sundar-rabindranath, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

(n_rows, bm_cols), dtype=torch.uint32, device=topk_ids.device
(bm_cols, triton.cdiv(n_rows, 32) * 32),
dtype=torch.uint32,
device=topk_ids.device,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defensively use 32 directly. aliasing with BLOCK_SIZE_K will lead to a wrong definition of bitmatrix if BLOCK_SIZE_K changes.

Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work tracking this down.

I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.

Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs:
https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor Author

Nice work tracking this down.

I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.

Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs: https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

Yes. I can check on CUDA 👍

@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor Author

Nice work tracking this down.
I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.
Tracking this down further:

The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.

This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs: https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy

Yes. I can check on CUDA 👍

checked in CUDA - on CUDA -1 // 32 => -1 and 1 // 32 => 0, as assumed by the kernel. It is just on RoCM that the behaviour is different.

cc @tlrmchlsmth

@github-project-automation github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 12, 2026
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) March 12, 2026 20:51
auto-merge was automatically disabled March 16, 2026 20:48

Head branch was pushed to by a user without write access

Varun Sundar Rabindranath added 2 commits April 2, 2026 14:17
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 7, 2026
@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor Author

fixed in #38504

@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working gpt-oss Related to GPT-OSS models needs-rebase ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants