[Bugfix] Fix benchmark_fused_collective crash on CustomOp init#34665
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
2a301af to
28f41f9
Compare
There was a problem hiding this comment.
Code Review
This pull request fixes a crash in the fused collective benchmark by wrapping the VllmFusedAllreduce instantiation in a VllmConfig context. The fix is correct in addressing the crash. However, I've identified a critical issue with the benchmark's logic that this change highlights. The CustomOp implementations are bound at initialization, but the benchmark reuses the same object across different configurations, leading to incorrect measurements. I've left a detailed comment explaining the issue and suggesting a path to correct the benchmark's logic.
28f41f9 to
aa5fcf1
Compare
|
Ran into this crash while benchmarking allreduce fusion on H200. Traced it to VllmFusedAllreduce being instantiated outside a VllmConfig context, and also noticed the CustomOp forward dispatch was only bound once regardless of the config switch. Updated to re-instantiate per config block. |
|
Left some comments. I also encountered this. |
|
Thanks for the fix! @mayank-ketkar-sf the suggestions from @mmangkad sound good to me, could you apply them and then we can merge? |
|
@ProExpertProg @mmangkad ofcourse. thank you for the feedback. I am in agreement, so I can update it and respond back today :) Cheers both |
138cb04 to
78c679c
Compare
|
@ProExpertProg @mmangkad could you add ci tag if all looks okay to merge? |
55716c8 to
d0791ed
Compare
|
@ProExpertProg shall we merge this? |
|
uff I am sorry about back n forth. My first PR here, should be better next time. I will rebase on latest and send back again |
…dispatch VllmFusedAllreduce creates RMSNorm and QuantFP8 (both CustomOp subclasses) in __init__. CustomOp.__init__ calls dispatch_forward() which requires get_current_vllm_config(). Without a config context, the benchmark crashes with: AssertionError: Current vLLM config is not set. Additionally, CustomOp binds the forward method at init time based on the active config (native vs custom kernel). Creating the object once and reusing it across different config contexts meant the native-vs-custom comparison was incorrect. Fix: instantiate VllmFusedAllreduce inside each set_current_vllm_config block so the forward dispatch matches the intended configuration. Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Address review feedback from mmangkad: use op_suffix instead of suffix += to avoid corrupting the suffix variable across inner loop iterations. The bug caused incorrect result keys like: "standard_allreduce_custom_rms_norm_native_quant_fp8_custom_quant_fp8" Fixed by creating a new op_suffix variable that concatenates the base suffix with the quant suffix, leaving the outer loop's suffix unchanged. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Head branch was pushed to by a user without write access
d0791ed to
f7e2be3
Compare
|
@ProExpertProg Auto-merge was disabled because I pushed new commits. Could you re-enable it when you get |
…project#34665) Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com> Co-authored-by: Mayank Ketkar <mketkar@zoox.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…project#34665) Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com> Co-authored-by: Mayank Ketkar <mketkar@zoox.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…project#34665) Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com> Co-authored-by: Mayank Ketkar <mketkar@zoox.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…project#34665) Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com> Co-authored-by: Mayank Ketkar <mketkar@zoox.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…project#34665) Signed-off-by: Mayank Ketkar <mketkar@zoox.com> Signed-off-by: Mayank Ketkar <mayket04@gmail.com> Co-authored-by: Mayank Ketkar <mketkar@zoox.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Fix two bugs in
benchmarks/kernels/benchmark_fused_collective.py:Crash:
VllmFusedAllreducecreatesRMSNormandQuantFP8(CustomOpsubclasses) which calldispatch_forward()→get_current_vllm_config()during__init__. Without an activeVllmConfigcontext, the benchmark crashes immediately.Incorrect dispatch:
CustomOpbinds its forward method (native PyTorch vs CUDA custom kernel) at init time. The original code createdVllmFusedAllreduceonce and reused it across differentcustom_opsconfigurations, meaning the native-vs-custom comparison was measuring the same code path.Fix
Instantiate
VllmFusedAllreduceinside eachset_current_vllm_config()block so the forward dispatch matches the intended configuration.Steps to Reproduce (crash)
# On any multi-GPU setup with vLLM installed: torchrun --nproc_per_node=2 benchmarks/kernels/benchmark_fused_collective.py \ --num-tokens 128 512 --hidden-dim 4096 --quant-modes noneBefore fix: Crashes with
AssertionError: Current vLLM config is not setAfter fix: Benchmark runs, correctly dispatching native vs custom kernel per config
Test plan
torchrun --nproc_per_node=2 benchmarks/kernels/benchmark_fused_collective.py --num-tokens 128 512 --hidden-dim 4096 --quant-modes nonecompletes without crash_native_rms_normvs_custom_rms_normvariants