Skip to content

[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error#37461

Merged
ProExpertProg merged 8 commits intovllm-project:mainfrom
wzhao18:wzhao/fix-fi-ar-fusion-workspace
Mar 20, 2026
Merged

[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error#37461
ProExpertProg merged 8 commits intovllm-project:mainfrom
wzhao18:wzhao/fix-fi-ar-fusion-workspace

Conversation

@wzhao18
Copy link
Contributor

@wzhao18 wzhao18 commented Mar 18, 2026

Purpose

Fix #37468

Currently, the FlashInfer allreduce fusion workspace is created in AllReduceFusionPass.__init__. However, when torch compile directly loads the compiled module from cache, it skips running the passes. Thus, the workspace will not be initialized, causing error when the kernel is called which expects the workspace to be in place. This PR fixes this by adding workspace initialization to the kernel code call_trtllm_fused_allreduce_norm too when it is not already initialized.

Test Plan

vllm serve MiniMaxAI/MiniMax-M2.5 --trust-remote-code --stream-interval 20 --no-enable-prefix-caching --tensor-parallel-size 2

Error on main:

[multiproc_executor.py:932]   File "/vllm/vllm/compilation/piecewise_backend.py", line 197, in compiled_graph_wrapper
[multiproc_executor.py:932]     graph_output = compiled_graph(*args)
[multiproc_executor.py:932]                    ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 312, in __call__
[multiproc_executor.py:932]     return self.inner_fn(*args)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile_types.py", line 211, in __call__
[multiproc_executor.py:932]     return self.compiled_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/aot_autograd_result.py", line 679, in forward
[multiproc_executor.py:932]     return compiled_fn(list(runtime_args))
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2311, in __call__
[multiproc_executor.py:932]     return self.compiled_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper
[multiproc_executor.py:932]     all_outs = call_func_at_runtime_with_args(
[multiproc_executor.py:932]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
[multiproc_executor.py:932]     out = normalize_as_list(f(args))
[multiproc_executor.py:932]                             ^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper
[multiproc_executor.py:932]     return compiled_fn(runtime_args)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 682, in __call__
[multiproc_executor.py:932]     return self.current_callable(inputs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3444, in run
[multiproc_executor.py:932]     out = model(new_inputs)
[multiproc_executor.py:932]           ^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/tmp/torchinductor_weizha/2s/c2sm7zcxi4ch574czlqwjjrbozol5vx2qx6wexgv4zxnlp6ysdsb.py", line 1085, in call
[multiproc_executor.py:932]     torch.ops.vllm.flashinfer_trtllm_fused_allreduce_norm.default(allreduce_in=buf0, residual=buf2, norm_out=buf1, quant_out=None, scale_out=None, rms_gamma=arg3_1, rms_eps=1e-06, pattern_code=1, world_size=2, launch_with_pdl=True, fp32_acc=True, max_token_num=8192)
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 871, in __call__
[multiproc_executor.py:932]     return self._op(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
[multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
[multiproc_executor.py:932]     return fn(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in __torch_dispatch__
[multiproc_executor.py:932]     res = func(*args, **kwargs)
[multiproc_executor.py:932]           ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 871, in __call__
[multiproc_executor.py:932]     return self._op(*args, **kwargs)
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932]   File "/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 143, in call_trtllm_fused_allreduce_norm
[multiproc_executor.py:932]     assert workspace is not None, (
[multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
[multiproc_executor.py:932] AssertionError: Flashinfer workspace must be initialized when using flashinfer

Test Result

The error goes away from the fix.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the bug Something isn't working label Mar 18, 2026
@wzhao18 wzhao18 force-pushed the wzhao/fix-fi-ar-fusion-workspace branch from 75c475b to 067c3bf Compare March 18, 2026 16:56
@wzhao18
Copy link
Contributor Author

wzhao18 commented Mar 18, 2026

cc: @hjjq

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where the FlashInfer allreduce fusion workspace was not being initialized when the compiled module was loaded from cache, skipping the passes. The fix involves adding workspace initialization to the call_trtllm_fused_allreduce_norm kernel code when it's not already initialized. The code changes include adding a new function _initialize_fi_ar_workspaces to handle workspace initialization and modifying call_trtllm_fused_allreduce_norm to lazily initialize the workspace if it's not already initialized. Additionally, the initialization logic in AllReduceFusionPass.__init__ is updated to use the new function. I have identified a critical issue where the workspace initialization might fail silently, leading to unexpected behavior. I've provided a code suggestion to address this.

"Failed to initialize FlashInfer All Reduce workspace: %s. "
"AllReduce fusion pass will be disabled.",
e,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return False statement within the _initialize_fi_ar_workspaces function can lead to silent failures if the workspace initialization fails. It's crucial to propagate this failure to prevent the code from proceeding with an uninitialized workspace, which could lead to incorrect results or crashes. Raising a RuntimeError will ensure that the failure is explicitly handled.

                return False
            except Exception as e:
                if "multicast" in str(e).lower():
                    logger.warning_once(
                        "AllReduce fusion pass is disabled: flashinfer workspace "
                        "creation failed: %s. This is expected on GPUs without "
                        "NVSwitch (e.g., NVLink bridge-only or PCIe topologies). "
                        "Falling back to non-fused allreduce.",
                        str(e),
                    )
                else:
                    logger.warning_once(
                        "Failed to initialize FlashInfer All Reduce workspace: %s. "
                        "AllReduce fusion pass will be disabled.",
                        e,
                    )
                raise RuntimeError("Failed to initialize FlashInfer All Reduce workspace") from e # Raise RuntimeError to prevent silent failure
        return True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and I think the failure is properly propagated.

@mergify
Copy link

mergify bot commented Mar 18, 2026

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Contributor

@hjjq hjjq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

root and others added 2 commits March 18, 2026 11:24
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by:  <>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/fix-fi-ar-fusion-workspace branch from b707224 to 4930908 Compare March 18, 2026 18:24
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) March 19, 2026 20:37
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, this will also help vllm IR so looking forward to it

e,
)
return
if not _initialize_fi_ar_workspaces(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, can we refactor this such that the get_workspace functions allocate the proper workspaces, instead of this elaborate double get approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ProExpertProg, can you elaborate how you want to refactor this? There is double get now because we may need to allocated separate workspaces for trtllm and mnnvl backends respectively, as mnnvl does not support quant fusion.

Copy link
Contributor Author

@wzhao18 wzhao18 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean have a unified get_fi_ar_workspace function for both backends, and remove _initialize_fi_ar_workspaces? I'll see how this can be cleaned up a bit.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
auto-merge was automatically disabled March 20, 2026 01:51

Head branch was pushed to by a user without write access

@mergify mergify bot added the nvidia label Mar 20, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 20, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is what I had in mind, thanks for the fix & refactor! Two more nits

)
return

self.supports_quant_fusion = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should warn if this failed as well. Lack of quant fusion means lower perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. let me know if anything is missing.

wzhao18 and others added 4 commits March 19, 2026 19:26
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@ProExpertProg ProExpertProg merged commit 0140eaf into vllm-project:main Mar 20, 2026
70 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 20, 2026
chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026
…llm-project#37461)

Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: <>
Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com>
Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: FlashInfer allreduce fusion workspace uninitialized error

5 participants