[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ by ProExpertProg · Pull Request #34718 · vllm-project/vllm

ProExpertProg · 2026-02-17T16:11:50Z

Purpose

Turn on SiLU-Mul + NVFP4 quant fusion by default for optimization levels O1 and higher. This is a simple pointwise fusion with low impact on startup time and a good perf improvement. Note that this does not help MoE models as long as fused_moe is wrapped (#31985) (EDIT: fused kernel is called manually in #31832 already).

Test Plan

No new functionality, just new default behavior. E2E perf sweep, startup sweep, and lm_eval.

Test Result

`nvidia/Llama-3.1-8B-Instruct-NVFP4` (tp=1)

Eval

PR

local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.63	±	0.0485
		strict-match	5	exact_match	↑	0.61	±	0.0490

Main

local-completions (pretrained=nvidia/Llama-3.1-8B-Instruct-NVFP4,base_url=http://0.0.0.0:8869/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.63	±	0.0485
		strict-match	5	exact_match	↑	0.61	±	0.0490

Perf

Startup

	Cold Total (p50)	Cold Compile (p50)	Warm Total (p50)	Warm Compile (p50)
fused	40.529	12.949	20.585	3.451
unfused	37.967	12.71	24.659	3.213
fused (% vs unfused)	6.75%	1.88%	-16.52%	7.41%

`nvidia/Llama-3.3-70B-Instruct-NVFP4` (tp=4)

Perf

Startup

	Cold Total (p50)	Cold Compile (p50)	Warm Total (p50)	Warm Compile (p50)
fused	88.502	0.0	60.664	0.0
unfused	89.093	0.0	61.029	0.0
fused (% vs unfused)	-0.66%	nan%	-0.6%	nan%

Eval TBD

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

gemini-code-assist

Code Review

This pull request enables the SiLU-Mul + NVFP4 quantization fusion by default for optimization levels O1 and higher. The change is implemented by updating the enable_act_fusion function in vllm/config/vllm.py to also return true for NVFP4 quantized models. This allows the fuse_act_quant pass to run for these models, which can lead to performance improvements as shown in the test results. The logic is sound and the change is well-contained. I have no major concerns with this pull request.

mgoin

LGTM, thanks Luka

robertgshaw2-redhat · 2026-02-17T17:11:38Z

the test failures are genuine failures

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg · 2026-02-17T17:32:23Z

Good catch, thanks

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Turn on silu+fp4 quant fusion by default

238f271

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 17, 2026 16:11

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

mgoin approved these changes Feb 17, 2026

View reviewed changes

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed torch.compile nvidia labels Feb 17, 2026

github-project-automation bot added this to torch.compile integration and NVIDIA Feb 17, 2026

github-project-automation bot moved this to To triage in torch.compile integration Feb 17, 2026

github-project-automation bot moved this to Ready in NVIDIA Feb 17, 2026

Fix model_config=None

2e55859

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg enabled auto-merge (squash) February 17, 2026 17:56

ProExpertProg moved this from To triage to In progress in torch.compile integration Feb 17, 2026

ProExpertProg moved this from In progress to In review in torch.compile integration Feb 17, 2026

ProExpertProg merged commit 02e8f26 into vllm-project:main Feb 18, 2026
52 checks passed

github-project-automation bot moved this from In review to Done in torch.compile integration Feb 18, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 18, 2026

ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (vll…

7793486

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (vll…

0d0e99a

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (vll…

61d48eb

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (vll…

e6a88f5

…m-project#34718) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+#34718
ProExpertProg merged 2 commits intovllm-project:mainfrom
neuralmagic:luka/silu-quant-o1-fp4

ProExpertProg commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

robertgshaw2-redhat commented Feb 17, 2026

Uh oh!

ProExpertProg commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ProExpertProg commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

nvidia/Llama-3.1-8B-Instruct-NVFP4 (tp=1)

Eval

PR

Main

Perf

Startup

nvidia/Llama-3.3-70B-Instruct-NVFP4 (tp=4)

Perf

Startup

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 17, 2026

Uh oh!

ProExpertProg commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ProExpertProg commented Feb 17, 2026 •

edited by github-actions bot

Loading

`nvidia/Llama-3.1-8B-Instruct-NVFP4` (tp=1)

`nvidia/Llama-3.3-70B-Instruct-NVFP4` (tp=4)