[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718
Merged
ProExpertProg merged 2 commits intovllm-project:mainfrom Feb 18, 2026
Merged
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718ProExpertProg merged 2 commits intovllm-project:mainfrom
ProExpertProg merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request enables the SiLU-Mul + NVFP4 quantization fusion by default for optimization levels O1 and higher. The change is implemented by updating the enable_act_fusion function in vllm/config/vllm.py to also return true for NVFP4 quantized models. This allows the fuse_act_quant pass to run for these models, which can lead to performance improvements as shown in the test results. The logic is sound and the change is well-contained. I have no major concerns with this pull request.
Collaborator
|
the test failures are genuine failures |
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Collaborator
Author
|
Good catch, thanks |
jasonozuzu-cohere
pushed a commit
to jasonozuzu-cohere/vllm
that referenced
this pull request
Feb 18, 2026
…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
ZJY0516
pushed a commit
to ZJY0516/vllm
that referenced
this pull request
Feb 23, 2026
…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14
pushed a commit
to llsj14/vllm
that referenced
this pull request
Mar 1, 2026
…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>
scottgl9
added a commit
to scottgl9/vllm
that referenced
this pull request
Mar 2, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Mar 4, 2026
…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>
scottgl9
added a commit
to scottgl9/vllm
that referenced
this pull request
Mar 4, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
askliar
pushed a commit
to askliar/vllm
that referenced
this pull request
Mar 9, 2026
…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Turn on SiLU-Mul + NVFP4 quant fusion by default for optimization levels O1 and higher. This is a simple pointwise fusion with low impact on startup time and a good perf improvement. Note that this does not help MoE models as long as
fused_moeis wrapped (#31985) (EDIT: fused kernel is called manually in #31832 already).Test Plan
No new functionality, just new default behavior. E2E perf sweep, startup sweep, and lm_eval.
Test Result
nvidia/Llama-3.1-8B-Instruct-NVFP4(tp=1)Eval
PR
local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: autoMain
local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: autoPerf
Startup
nvidia/Llama-3.3-70B-Instruct-NVFP4(tp=4)Perf
Startup
Eval TBD