[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support by haosdent · Pull Request #35085 · vllm-project/vllm

haosdent · 2026-02-23T08:03:51Z

Purpose

Fixes #34891: On GPUs without NVSwitch (e.g., H200/H100 with NVLink bridge-only or PCIe topologies), AllReduceFusionPass.__init__() crashes with RuntimeError: [SymmDeviceMemory] Device does not support multicasting when flashinfer_comm.create_allreduce_fusion_workspace() tries to create a SymmDeviceMemory that requires multicast support.

This PR wraps the workspace creation call in a try/except RuntimeError that logs a warning and returns early, leaving the pass in its default disabled=True state. This follows the identical graceful degradation pattern used by SymmMemCommunicator in symm_mem.py. The model falls back to non-fused allreduce and starts normally.

Test Plan

Test Result

gemini-code-assist

Code Review

This pull request effectively addresses the RuntimeError that occurs on GPUs without NVSwitch support by wrapping the flashinfer_comm.create_allreduce_fusion_workspace call in a try-except block. The implementation correctly logs a warning and gracefully disables the AllReduceFusionPass, ensuring that the model falls back to non-fused allreduce without crashing. This approach significantly improves the robustness of the system on diverse hardware configurations.

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

…cast support Fixes vllm-project#34891: Wrap flashinfer workspace creation in try/except to prevent crash on GPUs without NVSwitch (e.g., NVLink bridge-only or PCIe topologies). The pass is gracefully disabled with a warning instead of crashing the process. Signed-off-by: haosdent <haosdent@gmail.com>

…cast support Fixes vllm-project#34891: Wrap flashinfer workspace creation in try/except to prevent crash on GPUs without NVSwitch (e.g., NVLink bridge-only or PCIe topologies). Only the specific multicast-related RuntimeError is caught; other RuntimeErrors are re-raised. The pass is gracefully disabled with a warning instead of crashing the process. Signed-off-by: haosdent <haosdent@gmail.com>

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

haosdent requested review from ProExpertProg, youkaichao and zou3519 as code owners February 23, 2026 08:03

mergify bot added the bug Something isn't working label Feb 23, 2026

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

haosdent mentioned this pull request Feb 23, 2026

[Bug]: RuntimeError: [SymmDeviceMemory] Device does not support multicasting. #34891

Closed

haosdent changed the title ~~[WIP][Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support~~ [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support Feb 23, 2026

ProExpertProg reviewed Feb 23, 2026

View reviewed changes

vllm/compilation/passes/fusion/allreduce_rms_fusion.py Show resolved Hide resolved

ProExpertProg approved these changes Feb 25, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) February 25, 2026 02:00

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 25, 2026

zou3519 reviewed Feb 25, 2026

View reviewed changes

vllm/compilation/passes/fusion/allreduce_rms_fusion.py Show resolved Hide resolved

zou3519 approved these changes Feb 25, 2026

View reviewed changes

haosdent added 2 commits February 25, 2026 11:33

auto-merge was automatically disabled February 25, 2026 03:33
Head branch was pushed to by a user without write access

haosdent force-pushed the fix-34891 branch from 6692fcd to 74d8949 Compare February 25, 2026 03:33

ProExpertProg enabled auto-merge (squash) February 25, 2026 14:24

vllm-bot merged commit 0788ff0 into vllm-project:main Feb 25, 2026
57 of 59 checks passed

wzhao18 mentioned this pull request Feb 25, 2026

[Kernel] Refactor FlashInfer allreduce for mnnvl backend #34109

Merged

5 tasks

haanjack pushed a commit to haanjack/vllm that referenced this pull request Feb 26, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

e2342ed

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

02592fc

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

ehfd mentioned this pull request Feb 26, 2026

[Bug]: Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton #35390

Closed

1 task

vitush93 mentioned this pull request Feb 27, 2026

[Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 and fuse_allreduce_rms=False (H100s on PCIe) #35504

Closed

1 task

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

b6cdb35

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

benchislett mentioned this pull request Mar 2, 2026

[Bug] H100 PCIe: RuntimeError '[SymmDeviceMemory] Device does not support multicasting' when running Qwen3.5-122B with TP=2 #35724

Closed

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

c020cc1

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

62d58d1

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multi…

ee6830a

…cast support (vllm-project#35085) Signed-off-by: haosdent <haosdent@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support#35085

[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support#35085
vllm-bot merged 2 commits intovllm-project:mainfrom
haosdent:fix-34891

haosdent commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

haosdent commented Feb 23, 2026

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants