Skip to content

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718

Merged
ProExpertProg merged 2 commits intovllm-project:mainfrom
neuralmagic:luka/silu-quant-o1-fp4
Feb 18, 2026
Merged

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718
ProExpertProg merged 2 commits intovllm-project:mainfrom
neuralmagic:luka/silu-quant-o1-fp4

Conversation

@ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented Feb 17, 2026

Purpose

Turn on SiLU-Mul + NVFP4 quant fusion by default for optimization levels O1 and higher. This is a simple pointwise fusion with low impact on startup time and a good perf improvement. Note that this does not help MoE models as long as fused_moe is wrapped (#31985) (EDIT: fused kernel is called manually in #31832 already).

Test Plan

No new functionality, just new default behavior. E2E perf sweep, startup sweep, and lm_eval.

Test Result

nvidia/Llama-3.1-8B-Instruct-NVFP4 (tp=1)

Eval

PR

local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.63 ± 0.0485
strict-match 5 exact_match 0.61 ± 0.0490
Main

local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.63 ± 0.0485
strict-match 5 exact_match 0.61 ± 0.0490

Perf

tpot_20 ttft_20

Startup

Cold Total (p50) Cold Compile (p50) Warm Total (p50) Warm Compile (p50)
fused 40.529 12.949 20.585 3.451
unfused 37.967 12.71 24.659 3.213
fused (% vs unfused) 6.75% 1.88% -16.52% 7.41%

nvidia/Llama-3.3-70B-Instruct-NVFP4 (tp=4)

Perf

tpot_20 ttft_20

Startup

Cold Total (p50) Cold Compile (p50) Warm Total (p50) Warm Compile (p50)
fused 88.502 0.0 60.664 0.0
unfused 89.093 0.0 61.029 0.0
fused (% vs unfused) -0.66% nan% -0.6% nan%

Eval TBD

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the SiLU-Mul + NVFP4 quantization fusion by default for optimization levels O1 and higher. The change is implemented by updating the enable_act_fusion function in vllm/config/vllm.py to also return true for NVFP4 quantized models. This allows the fuse_act_quant pass to run for these models, which can lead to performance improvements as shown in the test results. The logic is sound and the change is well-contained. I have no major concerns with this pull request.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Luka

@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed torch.compile nvidia labels Feb 17, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 17, 2026
@robertgshaw2-redhat
Copy link
Collaborator

the test failures are genuine failures

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg
Copy link
Collaborator Author

Good catch, thanks

@ProExpertProg ProExpertProg enabled auto-merge (squash) February 17, 2026 17:56
@ProExpertProg ProExpertProg moved this from To triage to In progress in torch.compile integration Feb 17, 2026
@ProExpertProg ProExpertProg moved this from In progress to In review in torch.compile integration Feb 17, 2026
@ProExpertProg ProExpertProg merged commit 02e8f26 into vllm-project:main Feb 18, 2026
52 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in torch.compile integration Feb 18, 2026
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 18, 2026
jasonozuzu-cohere pushed a commit to jasonozuzu-cohere/vllm that referenced this pull request Feb 18, 2026
…m-project#34718)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…m-project#34718)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…m-project#34718)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed torch.compile

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants