[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error by wzhao18 · Pull Request #37461 · vllm-project/vllm

wzhao18 · 2026-03-18T16:55:30Z

Purpose

Currently, the FlashInfer allreduce fusion workspace is created in AllReduceFusionPass.__init__. However, when torch compile directly loads the compiled module from cache, it skips running the passes. Thus, the workspace will not be initialized, causing error when the kernel is called which expects the workspace to be in place. This PR fixes this by adding workspace initialization to the kernel code call_trtllm_fused_allreduce_norm too when it is not already initialized.

Test Plan

vllm serve MiniMaxAI/MiniMax-M2.5 --trust-remote-code --stream-interval 20 --no-enable-prefix-caching --tensor-parallel-size 2

Error on main:

[multiproc_executor.py:932]   File "/vllm/vllm/compilation/piecewise_backend.py", line 197, in compiled_graph_wrapper
[multiproc_executor.py:932]     graph_output = compiled_graph(*args)
[multiproc_executor.py:932]                    ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 312, in __call__
[multiproc_executor.py:932]     return self.inner_fn(*args)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile_types.py", line 211, in __call__
[multiproc_executor.py:932]     return self.compiled_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/aot_autograd_result.py", line 679, in forward
[multiproc_executor.py:932]     return compiled_fn(list(runtime_args))
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2311, in __call__
[multiproc_executor.py:932]     return self.compiled_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper
[multiproc_executor.py:932]     all_outs = call_func_at_runtime_with_args(
[multiproc_executor.py:932]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
[multiproc_executor.py:932]     out = normalize_as_list(f(args))
[multiproc_executor.py:932]                             ^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper
[multiproc_executor.py:932]     return compiled_fn(runtime_args)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 682, in __call__
[multiproc_executor.py:932]     return self.current_callable(inputs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3444, in run
[multiproc_executor.py:932]     out = model(new_inputs)
[multiproc_executor.py:932]           ^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/tmp/torchinductor_weizha/2s/c2sm7zcxi4ch574czlqwjjrbozol5vx2qx6wexgv4zxnlp6ysdsb.py", line 1085, in call
[multiproc_executor.py:932]     torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default(allreduce_in=buf0, residual=buf2, norm_out=buf1, quant_out=None, scale_out=None, rms_gamma=arg3_1, rms_eps=1e-06, pattern_code=1, world_size=2, launch_with_pdl=True, fp32_acc=True, max_token_num=8192)
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 871, in __call__
[multiproc_executor.py:932]     return self._op(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
[multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
[multiproc_executor.py:932]     return fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in __torch_dispatch__
[multiproc_executor.py:932]     res = func(*args, **kwargs)
[multiproc_executor.py:932]           ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 871, in __call__
[multiproc_executor.py:932]     return self._op(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 143, in call_trtllm_fused_allreduce_norm
[multiproc_executor.py:932]     assert workspace is not None, (
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932] AssertionError: Flashinfer workspace must be initialized when using flashinfer

Test Result

The error goes away from the fix.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

wzhao18 · 2026-03-18T16:56:57Z

cc: @hjjq

gemini-code-assist

Code Review

This pull request addresses a bug where the FlashInfer allreduce fusion workspace was not being initialized when the compiled module was loaded from cache, skipping the passes. The fix involves adding workspace initialization to the call_trtllm_fused_allreduce_norm kernel code when it's not already initialized. The code changes include adding a new function _initialize_fi_ar_workspaces to handle workspace initialization and modifying call_trtllm_fused_allreduce_norm to lazily initialize the workspace if it's not already initialized. Additionally, the initialization logic in AllReduceFusionPass.__init__ is updated to use the new function. I have identified a critical issue where the workspace initialization might fail silently, leading to unexpected behavior. I've provided a code suggestion to address this.

gemini-code-assist · 2026-03-18T16:58:52Z

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

+                        "Failed to initialize FlashInfer All Reduce workspace: %s. "
+                        "AllReduce fusion pass will be disabled.",
+                        e,
+                    )


The return False statement within the _initialize_fi_ar_workspaces function can lead to silent failures if the workspace initialization fails. It's crucial to propagate this failure to prevent the code from proceeding with an uninitialized workspace, which could lead to incorrect results or crashes. Raising a RuntimeError will ensure that the failure is explicitly handled.

return False except Exception as e: if "multicast" in str(e).lower(): logger.warning_once( "AllReduce fusion pass is disabled: flashinfer workspace " "creation failed: %s. This is expected on GPUs without " "NVSwitch (e.g., NVLink bridge-only or PCIe topologies). " "Falling back to non-fused allreduce.", str(e), ) else: logger.warning_once( "Failed to initialize FlashInfer All Reduce workspace: %s. " "AllReduce fusion pass will be disabled.", e, ) raise RuntimeError("Failed to initialize FlashInfer All Reduce workspace") from e # Raise RuntimeError to prevent silent failure return True

I double checked and I think the failure is properly propagated.

mergify · 2026-03-18T17:00:34Z

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

hjjq

Thanks! LGTM

Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Signed-off-by: <> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ProExpertProg

Thanks for fixing this, this will also help vllm IR so looking forward to it

ProExpertProg · 2026-03-19T23:34:54Z

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

-                        e,
-                    )
-                return
+        if not _initialize_fi_ar_workspaces(


Nice, can we refactor this such that the get_workspace functions allocate the proper workspaces, instead of this elaborate double get approach?

Hi @ProExpertProg, can you elaborate how you want to refactor this? There is double get now because we may need to allocated separate workspaces for trtllm and mnnvl backends respectively, as mnnvl does not support quant fusion.

Do you mean have a unified get_fi_ar_workspace function for both backends, and remove _initialize_fi_ar_workspaces? I'll see how this can be cleaned up a bit.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ProExpertProg

Yep, this is what I had in mind, thanks for the fix & refactor! Two more nits

ProExpertProg · 2026-03-20T02:02:33Z

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

+            )
+            return
+
+        self.supports_quant_fusion = (


We should warn if this failed as well. Lack of quant fusion means lower perf

sounds good. will add.

Updated. let me know if anything is missing.

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…llm-project#37461) Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: <> Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com> Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com>

wzhao18 requested review from BoyuanFeng, ProExpertProg, youkaichao and zou3519 as code owners March 18, 2026 16:55

mergify bot added the bug Something isn't working label Mar 18, 2026

wzhao18 force-pushed the wzhao/fix-fi-ar-fusion-workspace branch from 75c475b to 067c3bf Compare March 18, 2026 16:56

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

hjjq approved these changes Mar 18, 2026

View reviewed changes

root and others added 2 commits March 18, 2026 11:24

fix fi ar fusion workspace not initialized

1a846ad

Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Fix formatting

4930908

Signed-off-by: <> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 force-pushed the wzhao/fix-fi-ar-fusion-workspace branch from b707224 to 4930908 Compare March 18, 2026 18:24

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026

robertgshaw2-redhat enabled auto-merge (squash) March 19, 2026 20:37

zou3519 approved these changes Mar 19, 2026

View reviewed changes

ProExpertProg reviewed Mar 19, 2026

View reviewed changes

Code clean up

fe84ce8

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

auto-merge was automatically disabled March 20, 2026 01:51
Head branch was pushed to by a user without write access

mergify bot added the nvidia label Mar 20, 2026

github-project-automation bot added this to NVIDIA Mar 20, 2026

github-project-automation bot moved this to Ready in NVIDIA Mar 20, 2026

Code clean up

43f309b

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ProExpertProg approved these changes Mar 20, 2026

View reviewed changes

wzhao18 and others added 4 commits March 19, 2026 19:26

Add warning is AR quant fusion is disabled

4a0240c

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Merge branch 'main' into wzhao/fix-fi-ar-fusion-workspace

441ad8e

quant workspace may reuse non-quant

6ede8bd

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Update destroy_fi_ar_workspace

863debd

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ProExpertProg merged commit 0140eaf into vllm-project:main Mar 20, 2026
70 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 20, 2026

Uh oh!

Conversation

wzhao18 commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

wzhao18 commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

hjjq left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wzhao18 commented Mar 18, 2026 •

edited by github-actions bot

Loading

wzhao18 Mar 19, 2026 •

edited

Loading