[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791
[Bugfix][RoCM] GPT-OSS + Expert Parallel#35791varun-sundar-rabindranath wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
@zyongye @tjtanaa @mgoin @elizabetht PTAL! Thanks 🙌 |
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug causing segmentation faults on RoCM when using expert parallelism with GPT-OSS models. The fix correctly handles invalid expert IDs by adding an explicit check, which makes the implementation more robust and platform-independent. The pull request also includes a performance optimization by transposing the bitmatrix for more efficient memory access, along with a compatibility update for the Bitmatrix constructor. The changes are well-justified and correctly implemented.
|
Hi @varun-sundar-rabindranath, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| (n_rows, bm_cols), dtype=torch.uint32, device=topk_ids.device | ||
| (bm_cols, triton.cdiv(n_rows, 32) * 32), | ||
| dtype=torch.uint32, | ||
| device=topk_ids.device, | ||
| ) |
There was a problem hiding this comment.
defensively use 32 directly. aliasing with BLOCK_SIZE_K will lead to a wrong definition of bitmatrix if BLOCK_SIZE_K changes.
There was a problem hiding this comment.
Nice work tracking this down.
I'd like to understand if this also affects CUDA, or if we could be causing performance regressions in some cases.
Tracking this down further:
The current code has the assumption that -1 // 32 will evaluate to -1 and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see -1 // 32 evaluates to 0 and results in setting the expert -1 % 32 = 31st expert. This is incorrect.
This confusion appears to be caused a difference in behavior between triton and python and numpy, so should affect both NVIDIA and AMD GPUs:
https://triton-lang.org/main/python-api/triton-semantics.html#differences-with-numpy
Yes. I can check on CUDA 👍 |
checked in CUDA - on CUDA cc @tlrmchlsmth |
Head branch was pushed to by a user without write access
728f33c to
ab77b2d
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
7fce8ba to
6ee270e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
fixed in #38504 |
Purpose
On local source-build using
python setup.py developon RoCM, gpt-oss + expert-parallel segfaultRepro command:
vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallelWhy: When using expert-parallel, the invalid topk-ids are marked as
-1. The bitmatrix construction kernel is supposed to ignore this. This is achieved by comparingtopk_id // 32withoffs(which are all >= 0) into the bitmatrix atvllm/vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Line 100 in 2a9e334
The current code has the assumption that
-1 // 32will evaluate to-1and the comparison will fail. However, I see that this may not be the case always. On RoCM, I see-1 // 32evaluates to0and results in setting the expert-1 % 32 = 31st expert. This is incorrect.Fix: The fix is to robustify the
invalid_expertcheck by including the topk-ids directly in the masking operation.Note:
Test Plan
vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-paralleleval :
OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://localhost:8000/v1Test Result
vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallelstarts up fine. i.e. no segfaulteval :
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-low_temp1.0_20260302_192828', 'metric': 0.6527777777777778}]