[NVIDIA] Support SiluMul + NVFP4 quant fusion by elvischenv · Pull Request #23671 · vllm-project/vllm

elvischenv · 2025-08-26T15:53:05Z

Purpose

Support Silu_Mul + NVFP4 quant fusion(following up #22448).
Add these compilation flags to enable the fusion:

--compilation-config {"custom_ops":["+silu_and_mul"],"pass_config":{"enable_fusion":true,"enable_noop":true}}

Test Plan && Test Result

Kernel functional:
tests/kernels/quantization/test_silu_nvfp4_quant_fusion.py

====== 8 passed in 2.76s =====

Fusion unit test:
tests/compile/test_silu_mul_quant_fusion.py

====== 3 passed, 1 skipped, 5 warnings in 4.38s =====

lm_eval && benchmarking:
main:

============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  174.31
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.67
Output token throughput (tok/s):         3759.83
Total Token throughput (tok/s):          7512.16
---------------Time to First Token----------------
Mean TTFT (ms):                          1458.45
Median TTFT (ms):                        1067.94
P99 TTFT (ms):                           5477.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.61
Median TPOT (ms):                        32.82
P99 TPOT (ms):                           34.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.61
Median ITL (ms):                         29.03
P99 ITL (ms):                            331.72
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0111|
|     |       |strict-match    |     5|exact_match|↑  |0.858|±  |0.0156|

PR:

disable fusion:
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  172.20
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.72
Output token throughput (tok/s):         3805.72
Total Token throughput (tok/s):          7603.85
---------------Time to First Token----------------
Mean TTFT (ms):                          1452.02
Median TTFT (ms):                        1063.62
P99 TTFT (ms):                           5508.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.21
Median TPOT (ms):                        32.42
P99 TPOT (ms):                           33.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.21
Median ITL (ms):                         28.71
P99 ITL (ms):                            325.90
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.842|±  |0.0163|

enable fusion:
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  170.02
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.76
Output token throughput (tok/s):         3854.49
Total Token throughput (tok/s):          7701.30
---------------Time to First Token----------------
Mean TTFT (ms):                          1430.17
Median TTFT (ms):                        1036.29
P99 TTFT (ms):                           5577.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.80
Median TPOT (ms):                        32.01
P99 TPOT (ms):                           33.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.80
Median ITL (ms):                         28.41
P99 ITL (ms):                            337.28
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'compilation_config': {'custom_ops': ['+silu_and_mul'], 'pass_config': {'enable_fusion': True, 'enable_noop': True}}, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.926|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.852|±  |0.0159|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for fusing SiLU+Mul with NVFP4 quantization, which is a valuable performance optimization for models running on NVIDIA GPUs with FP4 support. The changes are well-structured, including a new CUDA kernel, updates to the compilation passes for fusion, and comprehensive tests. The refactoring of the fusion pass and tests to accommodate the new pattern is clean. My review found one potential issue with pointer casting in the CUDA kernel wrapper that could lead to undefined behavior, and I've provided a suggestion to fix it. Overall, this is a solid contribution.

gemini-code-assist · 2025-08-26T15:55:26Z

csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu

The pointer casts for output_ptr and sf_out are unsafe and overly complex, and the subsequent reinterpret_cast in the kernel launch can be avoided.

static_cast<int64_t*>(output.data_ptr()) is unsafe. The output tensor is of type torch.uint8, so its data buffer is not guaranteed to have the 8-byte alignment required for int64_t*. This can lead to undefined behavior.

The kernel expects uint32_t* for both out and SFout. It's cleaner to cast directly to this type using reinterpret_cast.

By casting directly to uint32_t* when defining output_ptr and sf_out, you can simplify the kernel launch call by removing the reinterpret_cast there.

void silu_and_mul_nvfp4_quant(torch::Tensor& output, // [..., d] torch::Tensor& output_sf, torch::Tensor& input, // [..., 2 * d] torch::Tensor& input_sf) { TORCH_CHECK(input.dtype() == torch::kFloat16 || input.dtype() == torch::kBFloat16); int32_t m = input.size(0); int32_t n = input.size(1) / 2; TORCH_CHECK(n % 16 == 0, "The N dimension must be multiple of 16."); int multiProcessorCount = get_device_attribute(cudaDevAttrMultiProcessorCount, -1); auto input_sf_ptr = static_cast<float const*>(input_sf.data_ptr()); auto sf_out = reinterpret_cast<uint32_t*>(output_sf.data_ptr()); auto output_ptr = reinterpret_cast<uint32_t*>(output.data_ptr()); const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); auto stream = at::cuda::getCurrentCUDAStream(input.get_device()); dim3 block(std::min(int(n / ELTS_PER_THREAD), 1024)); int const numBlocksPerSM = 2048 / block.x; dim3 grid(std::min(int(m), multiProcessorCount * numBlocksPerSM)); VLLM_DISPATCH_HALF_TYPES( input.scalar_type(), "act_and_mul_quant_kernel", [&] { auto input_ptr = reinterpret_cast<scalar_t const*>(input.data_ptr()); VLLM_DISPATCH_BYTE_TYPES( output.scalar_type(), "fused_act_and_mul_quant_kernel_nvfp4_type", [&] { vllm::silu_and_cvt_fp16_to_fp4<scalar_t> <<<grid, block, 0, stream>>>( m, n, input_ptr, input_sf_ptr, output_ptr, sf_out); }); }); }

ProExpertProg

Looks nice and clean, thanks for the refactoring! A few final comments and create an issue for the kernel comments for follow up

ProExpertProg · 2025-08-26T20:40:37Z

vllm/compilation/activation_quant_fusion.py

Could you add a FUSED_OPs array here as well?

ProExpertProg · 2025-08-26T20:41:43Z

tests/compile/test_silu_mul_quant_fusion.py

Could this reference FUSED_OPS and QUANT_OPS instead?

Also this could use ops_in_model_before (see other tests on how that's checked)

elvischenv · 2025-08-27T05:16:35Z

Update: fixed by yapf: disable and yapf: enable

conflict between yapf and isort:

yapf................................................................................................Failed
- hook id: yapf
- files were modified by this hook

Reformatting tests/compile/test_silu_mul_quant_fusion.py

yapf modified the code to

from vllm.compilation.activation_quant_fusion import (
    ActivationQuantFusionPass, FUSED_OPS, SILU_MUL_OP)

isort...............................................................................................Failed
- hook id: isort
- files were modified by this hook

isort modified the code to

from vllm.compilation.activation_quant_fusion import (
    FUSED_OPS, SILU_MUL_OP, ActivationQuantFusionPass)

elvischenv · 2025-08-27T15:08:02Z

Look like it is failed to create tensor on L4:
https://buildkite.com/vllm/ci/builds/28558/steps/canvas?jid=0198ebac-8e86-4f3c-8697-dcf5829076a9

[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/pass_manager.py:66: in configure
[2025-08-27T13:21:10Z]     self.passes += [ActivationQuantFusionPass(config)]
[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/activation_quant_fusion.py:170: in __init__
[2025-08-27T13:21:10Z]     pattern_silu_mul_fp8.register(self.patterns)
[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/activation_quant_fusion.py:100: in register
[2025-08-27T13:21:10Z]     self.empty_quant(5, 4),  # result
[2025-08-27T13:21:10Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-27T13:21:10Z]
[2025-08-27T13:21:10Z] self = <vllm.compilation.activation_quant_fusion.SiluMulFp8StaticQuantPattern object at 0x7f5cec68be00>
[2025-08-27T13:21:10Z] args = (5, 4), kwargs = {'device': 'cuda', 'dtype': torch.float8_e4m3fn}
[2025-08-27T13:21:10Z]
[2025-08-27T13:21:10Z]     def empty_quant(self, *args, **kwargs):
[2025-08-27T13:21:10Z]         kwargs = {'dtype': self.quant_dtype, 'device': "cuda", **kwargs}
[2025-08-27T13:21:10Z] >       return torch.empty(*args, **kwargs)
[2025-08-27T13:21:10Z] E       RuntimeError: CUDA error: no kernel image is available for execution on the device

I got a L4 locally and tried creating tensors and it worked. Is the failure related to the driver or something else for the L4 in CI? cc @ProExpertProg @mgoin

$ nvidia-smi
Wed Aug 27 14:59:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:C1:00.0 Off |                    0 |
| N/A   42C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ pip list | grep torch
torch                             2.7.1+cu128
torchaudio                        2.7.1+cu128
torchvision                       0.22.1+cu128
$ python -c "import torch; a = torch.ones((5,4), device='cuda', dtype=torch.float8_e4m3fn); print(a)"
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0', dtype=torch.float8_e4m3fn)

Signed-off-by: jindih <jindih@nvidia.com> fix review comment Signed-off-by: jindih <jindih@nvidia.com> revise silu+nvfp4q pattern matching part Signed-off-by: jindih <jindih@nvidia.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

mergify · 2025-08-28T15:30:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…3671)" Fixes vllm-project#23925 This reverts commit 16a45b3.

Signed-off-by: jindih <jindih@nvidia.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: jindih <jindih@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Luka Govedic <lgovedic@redhat.com>

elvischenv requested review from LucasWilkinson, ProExpertProg, WoosukKwon, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners August 26, 2025 15:53

mergify bot added the ci/build label Aug 26, 2025

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

ProExpertProg mentioned this pull request Aug 26, 2025

support silu+nvfp4 quant fusion #22448

Closed

ProExpertProg approved these changes Aug 26, 2025

View reviewed changes

elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from ed4f126 to d7831c6 Compare August 27, 2025 03:55

elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from d7831c6 to 3968b5b Compare August 27, 2025 06:56

ProExpertProg approved these changes Aug 27, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) August 27, 2025 13:13

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025

stickingjh and others added 5 commits August 28, 2025 00:55

support silu+fp4 quant fusion

86e1e79

Signed-off-by: jindih <jindih@nvidia.com> fix review comment Signed-off-by: jindih <jindih@nvidia.com> revise silu+nvfp4q pattern matching part Signed-off-by: jindih <jindih@nvidia.com>

fix comments

12cb781

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add fusion test

0d1b0f4

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add fusion unit test to CI

5a1c7cf

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add FUSED_OP

8b479b2

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

auto-merge was automatically disabled August 28, 2025 07:55
Head branch was pushed to by a user without write access

elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from 3968b5b to 8b479b2 Compare August 28, 2025 07:55

Merge branch 'main' into elvischenv/silu-nvfp4-quant-fusion

f33ca63

mergify bot added the needs-rebase label Aug 28, 2025

ProExpertProg enabled auto-merge (squash) August 28, 2025 17:53

ProExpertProg added the torch.compile label Aug 28, 2025

github-project-automation bot added this to torch.compile integration Aug 28, 2025

github-project-automation bot moved this to To triage in torch.compile integration Aug 28, 2025

ProExpertProg merged commit 16a45b3 into vllm-project:main Aug 28, 2025
72 checks passed

github-project-automation bot moved this from To triage to Done in torch.compile integration Aug 28, 2025

wangxiyuan mentioned this pull request Aug 29, 2025

[Platform] import activation_quant_fusion for CUDA only #23882

Merged

This was referenced Aug 29, 2025

[Bug]: 'AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul_nvfp4_quant' #23916

Closed

[BUGFIX ]'AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul_nvfp4_quant' #23924

Closed

vllmellm mentioned this pull request Aug 29, 2025

[Bug]: 'AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul_nvfp4_quant' vllmellm/vllm#52

Open

1 task

zou3519 mentioned this pull request Aug 29, 2025

[Bug]: _C.abi3.so: undefined symbol: _Z24silu_and_mul_nvfp4_quantRN2at6TensorES1_S1_S1_ #23925

Closed

1 task

zou3519 added a commit to zou3519/vllm that referenced this pull request Aug 29, 2025

Revert "[NVIDIA] Support SiluMul + NVFP4 quant fusion (vllm-project#2…

19c5cfd

…3671)" Fixes vllm-project#23925 This reverts commit 16a45b3.

zou3519 mentioned this pull request Aug 29, 2025

[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant #23929

Merged

sarckk mentioned this pull request Aug 30, 2025

[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion #23973

Merged

5 tasks

MatthewBonanni mentioned this pull request Sep 2, 2025

[CI/Build] Disable SiluMul NVFP4 quant fusion tests #24121

Merged

5 tasks

elvischenv deleted the elvischenv/silu-nvfp4-quant-fusion branch September 3, 2025 01:51

elvischenv mentioned this pull request Sep 16, 2025

[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel #24833

Merged

5 tasks

mgoin mentioned this pull request Jan 6, 2026

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe #31832

Merged

5 tasks

sparkecho mentioned this pull request Jan 20, 2026

[Feature]: Integrate RMS+fp4 fused kernel from FlashInfer #32612

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] Support SiluMul + NVFP4 quant fusion#23671

[NVIDIA] Support SiluMul + NVFP4 quant fusion#23671
ProExpertProg merged 7 commits intovllm-project:mainfrom
elvischenv:elvischenv/silu-nvfp4-quant-fusion

elvischenv commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

ProExpertProg Aug 26, 2025

Uh oh!

ProExpertProg Aug 26, 2025

Uh oh!

ProExpertProg Aug 26, 2025

Uh oh!

elvischenv commented Aug 27, 2025 •

edited

Loading

Uh oh!

elvischenv commented Aug 27, 2025 •

edited

Loading

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

elvischenv commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

elvischenv commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elvischenv commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elvischenv commented Aug 26, 2025 •

edited by github-actions bot

Loading

elvischenv commented Aug 27, 2025 •

edited

Loading

elvischenv commented Aug 27, 2025 •

edited

Loading