Skip to content

[Bugfix] Fix benchmark_fused_collective crash on CustomOp init#34665

Merged
ProExpertProg merged 3 commits intovllm-project:mainfrom
mayank-ketkar-sf:fix-benchmark-fused-collective
Feb 20, 2026
Merged

[Bugfix] Fix benchmark_fused_collective crash on CustomOp init#34665
ProExpertProg merged 3 commits intovllm-project:mainfrom
mayank-ketkar-sf:fix-benchmark-fused-collective

Conversation

@mayank-ketkar-sf
Copy link
Contributor

@mayank-ketkar-sf mayank-ketkar-sf commented Feb 17, 2026

Summary

Fix two bugs in benchmarks/kernels/benchmark_fused_collective.py:

  1. Crash: VllmFusedAllreduce creates RMSNorm and QuantFP8 (CustomOp subclasses) which call dispatch_forward()get_current_vllm_config() during __init__. Without an active VllmConfig context, the benchmark crashes immediately.

  2. Incorrect dispatch: CustomOp binds its forward method (native PyTorch vs CUDA custom kernel) at init time. The original code created VllmFusedAllreduce once and reused it across different custom_ops configurations, meaning the native-vs-custom comparison was measuring the same code path.

Fix

Instantiate VllmFusedAllreduce inside each set_current_vllm_config() block so the forward dispatch matches the intended configuration.

Steps to Reproduce (crash)

# On any multi-GPU setup with vLLM installed:
torchrun --nproc_per_node=2 benchmarks/kernels/benchmark_fused_collective.py \
  --num-tokens 128 512 --hidden-dim 4096 --quant-modes none

Before fix: Crashes with AssertionError: Current vLLM config is not set
After fix: Benchmark runs, correctly dispatching native vs custom kernel per config

Test plan

  • torchrun --nproc_per_node=2 benchmarks/kernels/benchmark_fused_collective.py --num-tokens 128 512 --hidden-dim 4096 --quant-modes none completes without crash
  • Output includes distinct timings for _native_rms_norm vs _custom_rms_norm variants
  • No changes to production code — benchmark-only fix

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added performance Performance-related issues bug Something isn't working labels Feb 17, 2026
@mayank-ketkar-sf mayank-ketkar-sf force-pushed the fix-benchmark-fused-collective branch from 2a301af to 28f41f9 Compare February 17, 2026 02:01
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a crash in the fused collective benchmark by wrapping the VllmFusedAllreduce instantiation in a VllmConfig context. The fix is correct in addressing the crash. However, I've identified a critical issue with the benchmark's logic that this change highlights. The CustomOp implementations are bound at initialization, but the benchmark reuses the same object across different configurations, leading to incorrect measurements. I've left a detailed comment explaining the issue and suggesting a path to correct the benchmark's logic.

@mayank-ketkar-sf mayank-ketkar-sf force-pushed the fix-benchmark-fused-collective branch from 28f41f9 to aa5fcf1 Compare February 17, 2026 02:04
@mayank-ketkar-sf
Copy link
Contributor Author

Ran into this crash while benchmarking allreduce fusion on H200. Traced it to VllmFusedAllreduce being instantiated outside a VllmConfig context, and also noticed the CustomOp forward dispatch was only bound once regardless of the config switch. Updated to re-instantiate per config block.

cc @ilmarkov @mmangkad for review

@mmangkad
Copy link
Contributor

Left some comments. I also encountered this.
cc @ProExpertProg

@ProExpertProg
Copy link
Collaborator

Thanks for the fix! @mayank-ketkar-sf the suggestions from @mmangkad sound good to me, could you apply them and then we can merge?

@mayank-ketkar-sf
Copy link
Contributor Author

@ProExpertProg @mmangkad ofcourse. thank you for the feedback. I am in agreement, so I can update it and respond back today :) Cheers both

@mayank-ketkar-sf mayank-ketkar-sf force-pushed the fix-benchmark-fused-collective branch 3 times, most recently from 138cb04 to 78c679c Compare February 18, 2026 18:03
@mayank-ketkar-sf
Copy link
Contributor Author

@ProExpertProg @mmangkad could you add ci tag if all looks okay to merge?

@mayank-ketkar-sf mayank-ketkar-sf force-pushed the fix-benchmark-fused-collective branch 2 times, most recently from 55716c8 to d0791ed Compare February 18, 2026 23:41
@mayank-ketkar-sf
Copy link
Contributor Author

@ProExpertProg shall we merge this?

@ProExpertProg ProExpertProg enabled auto-merge (squash) February 19, 2026 18:09
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 19, 2026
@mayank-ketkar-sf
Copy link
Contributor Author

uff I am sorry about back n forth. My first PR here, should be better next time. I will rebase on latest and send back again

Mayank Ketkar and others added 2 commits February 19, 2026 18:16
…dispatch

VllmFusedAllreduce creates RMSNorm and QuantFP8 (both CustomOp
subclasses) in __init__. CustomOp.__init__ calls dispatch_forward()
which requires get_current_vllm_config(). Without a config context,
the benchmark crashes with:

  AssertionError: Current vLLM config is not set.

Additionally, CustomOp binds the forward method at init time based
on the active config (native vs custom kernel). Creating the object
once and reusing it across different config contexts meant the
native-vs-custom comparison was incorrect.

Fix: instantiate VllmFusedAllreduce inside each set_current_vllm_config
block so the forward dispatch matches the intended configuration.

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Address review feedback from mmangkad: use op_suffix instead of suffix +=
to avoid corrupting the suffix variable across inner loop iterations.

The bug caused incorrect result keys like:
  "standard_allreduce_custom_rms_norm_native_quant_fp8_custom_quant_fp8"

Fixed by creating a new op_suffix variable that concatenates the base
suffix with the quant suffix, leaving the outer loop's suffix unchanged.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
auto-merge was automatically disabled February 19, 2026 18:16

Head branch was pushed to by a user without write access

@mayank-ketkar-sf mayank-ketkar-sf force-pushed the fix-benchmark-fused-collective branch from d0791ed to f7e2be3 Compare February 19, 2026 18:16
@mayank-ketkar-sf
Copy link
Contributor Author

@ProExpertProg Auto-merge was disabled because I pushed new commits. Could you re-enable it when you get
a chance? Thanks!
Apologies, I am learning the ropes here

@ProExpertProg ProExpertProg merged commit 648951a into vllm-project:main Feb 20, 2026
13 checks passed
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
…project#34665)

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Co-authored-by: Mayank Ketkar <mketkar@zoox.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…project#34665)

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Co-authored-by: Mayank Ketkar <mketkar@zoox.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…project#34665)

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Co-authored-by: Mayank Ketkar <mketkar@zoox.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…project#34665)

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Co-authored-by: Mayank Ketkar <mketkar@zoox.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…project#34665)

Signed-off-by: Mayank Ketkar <mketkar@zoox.com>
Signed-off-by: Mayank Ketkar <mayket04@gmail.com>
Co-authored-by: Mayank Ketkar <mketkar@zoox.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants