[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error#37461
Conversation
75c475b to
067c3bf
Compare
|
cc: @hjjq |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where the FlashInfer allreduce fusion workspace was not being initialized when the compiled module was loaded from cache, skipping the passes. The fix involves adding workspace initialization to the call_trtllm_fused_allreduce_norm kernel code when it's not already initialized. The code changes include adding a new function _initialize_fi_ar_workspaces to handle workspace initialization and modifying call_trtllm_fused_allreduce_norm to lazily initialize the workspace if it's not already initialized. Additionally, the initialization logic in AllReduceFusionPass.__init__ is updated to use the new function. I have identified a critical issue where the workspace initialization might fail silently, leading to unexpected behavior. I've provided a code suggestion to address this.
| "Failed to initialize FlashInfer All Reduce workspace: %s. " | ||
| "AllReduce fusion pass will be disabled.", | ||
| e, | ||
| ) |
There was a problem hiding this comment.
The return False statement within the _initialize_fi_ar_workspaces function can lead to silent failures if the workspace initialization fails. It's crucial to propagate this failure to prevent the code from proceeding with an uninitialized workspace, which could lead to incorrect results or crashes. Raising a RuntimeError will ensure that the failure is explicitly handled.
return False
except Exception as e:
if "multicast" in str(e).lower():
logger.warning_once(
"AllReduce fusion pass is disabled: flashinfer workspace "
"creation failed: %s. This is expected on GPUs without "
"NVSwitch (e.g., NVLink bridge-only or PCIe topologies). "
"Falling back to non-fused allreduce.",
str(e),
)
else:
logger.warning_once(
"Failed to initialize FlashInfer All Reduce workspace: %s. "
"AllReduce fusion pass will be disabled.",
e,
)
raise RuntimeError("Failed to initialize FlashInfer All Reduce workspace") from e # Raise RuntimeError to prevent silent failure
return TrueThere was a problem hiding this comment.
I double checked and I think the failure is properly propagated.
|
Hi @wzhao18, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: <> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
b707224 to
4930908
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Thanks for fixing this, this will also help vllm IR so looking forward to it
| e, | ||
| ) | ||
| return | ||
| if not _initialize_fi_ar_workspaces( |
There was a problem hiding this comment.
Nice, can we refactor this such that the get_workspace functions allocate the proper workspaces, instead of this elaborate double get approach?
There was a problem hiding this comment.
Hi @ProExpertProg, can you elaborate how you want to refactor this? There is double get now because we may need to allocated separate workspaces for trtllm and mnnvl backends respectively, as mnnvl does not support quant fusion.
There was a problem hiding this comment.
Do you mean have a unified get_fi_ar_workspace function for both backends, and remove _initialize_fi_ar_workspaces? I'll see how this can be cleaned up a bit.
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Head branch was pushed to by a user without write access
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
Yep, this is what I had in mind, thanks for the fix & refactor! Two more nits
| ) | ||
| return | ||
|
|
||
| self.supports_quant_fusion = ( |
There was a problem hiding this comment.
We should warn if this failed as well. Lack of quant fusion means lower perf
There was a problem hiding this comment.
sounds good. will add.
There was a problem hiding this comment.
Updated. let me know if anything is missing.
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37461) Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: <> Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com> Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com>
Purpose
Fix #37468
Currently, the FlashInfer allreduce fusion workspace is created in
AllReduceFusionPass.__init__. However, when torch compile directly loads the compiled module from cache, it skips running the passes. Thus, the workspace will not be initialized, causing error when the kernel is called which expects the workspace to be in place. This PR fixes this by adding workspace initialization to the kernel codecall_trtllm_fused_allreduce_normtoo when it is not already initialized.Test Plan
Error on main:
Test Result
The error goes away from the fix.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.