[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel by elvischenv · Pull Request #24833 · vllm-project/vllm

elvischenv · 2025-09-14T15:32:50Z

Purpose

Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel
- In previous [NVIDIA] Support SiluMul + NVFP4 quant fusion #23671, the original author used an approximate way to compute silu. The sigmoid was computed by tanh(x/2) * 0.5 + 0.5, while the standard definition should be 1 / (1 + e^(-x)).
- We didn't observe the accuracy issues previously by offline lm_eval evaluation. see [NVIDIA] Support SiluMul + NVFP4 quant fusion #23671. However, when doing online lm_eval, the accuracy is dropped slightly.
- We fixed the issue by using a standard implementation to compute silu_mul.

vllm/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu

Lines 34 to 54 in 5679399

    
           __inline__ __device__ PackedVec<Type> compute_silu(PackedVec<Type>& vec, 
        
                                                              PackedVec<Type>& vec2) { 
        
             PackedVec<Type> result; 
        
           #pragma unroll 
        
             for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) { 
        
               if constexpr (std::is_same_v<Type, half>) { 
        
                 half2 val(0.5f, 0.5f); 
        
                 half2 t0 = __hmul2(vec.elts[i], val); 
        
                 half2 t1 = __hfma2(h2tanh(t0), val, val); 
        
                 half2 t2 = __hmul2(vec.elts[i], t1); 
        
                 result.elts[i] = __hmul2(t2, vec2.elts[i]); 
        
               } else { 
        
                 __nv_bfloat162 val(0.5f, 0.5f); 
        
                 __nv_bfloat162 t0 = __hmul2(vec.elts[i], val); 
        
                 __nv_bfloat162 t1 = __hfma2(h2tanh(t0), val, val); 
        
                 __nv_bfloat162 t2 = __hmul2(vec.elts[i], t1); 
        
                 result.elts[i] = __hmul2(t2, vec2.elts[i]); 
        
               } 
        
             } 
        
             return result; 
        
           }

Do some cleanups for activation_nvfp4_quant_fusion_kernels.cu. We don't need an extra silu_and_cvt_warp_fp16_to_fp4 but just compute_silu_mul + cvt_warp_fp16_to_fp4(reuse from the nvfp4_utils).
There were only 3 tests covered the silu_mul + quant fusion in test_silu_mul_quant_fusion.py. Improved the test coverage.
Clean up the kernel test test_silu_mul_nvfp4_quant.py. Removed lots of unnecessary components.

Test Plan && Test Result

Unit test:

tests/compile/test_silu_mul_quant_fusion.py

====== 24 passed, 8 skipped, 5 warnings in 11.79s =====

tests/kernels/quantization/test_silu_mul_nvfp4_quant.py

======== 8 passed in 1.44s ======

E2E online lm_eval:

main not fused:

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9257|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.6209|±  |0.0134|

main fused(the accuracy dropped slightly):

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9287|±  |0.0071|
|     |       |strict-match    |     5|exact_match|↑  |0.6035|±  |0.0135|

PR fused:

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9325|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.6255|±  |0.0133|

Perf:

main not fused:

triton_poi_fused_mul_silu_4: 4.288 μs
cvt_fp16_to_fp4: 4.544 μs

main fused:

silu_mul_cvt_fp16_to_fp4: 6.527 μs

PR fused:

silu_mul_cvt_fp16_to_fp4: 6.624 μs

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

pytorch-bot · 2025-09-14T15:33:30Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

ProExpertProg · 2025-09-15T16:03:37Z

Can you compare to the Inductor-generated fused kernel?

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

mgoin

Looks good to me as a bugfix, we should merge. @ProExpertProg I don't think we have nvfp4 quant in torch implemented in a serious way, so we should leave that to future work

yewentao256

LGTM, thanks for the work!

…llm-project#24833) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

…llm-project#24833) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: charlifu <charlifu@amd.com>

…llm-project#24833) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

mergify bot added the ci/build label Sep 14, 2025

elvischenv changed the title ~~[Perf] Optimize silu_mul + FP8 quant fusion kernel for cuda~~ [Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel Sep 16, 2025

elvischenv added 6 commits September 16, 2025 10:56

cleanup dup code

c29cc44

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

test1: silu_mul in float16

ab3b756

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

test2: only sigmoid in float32

370f080

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

test3: only silu in float32

0575a7f

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

test4: silu_mul in float32

5bcde4c

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

unit test cleanup

a2bada3

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/optimize-silu-mul-quant-kernel branch from eb3cce4 to a2bada3 Compare September 16, 2025 17:56

elvischenv marked this pull request as ready for review September 16, 2025 17:56

elvischenv requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners September 16, 2025 17:56

mgoin approved these changes Sep 17, 2025

View reviewed changes

mgoin enabled auto-merge (squash) September 17, 2025 01:42

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025

mgoin added bug Something isn't working quantization labels Sep 17, 2025

elvischenv added 5 commits September 17, 2025 10:14

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

32d6b25

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

2426c7c

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

1d8e6a3

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

360962f

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

f8ebea3

yewentao256 approved these changes Sep 17, 2025

View reviewed changes

Merge branch 'main' into elvischenv/optimize-silu-mul-quant-kernel

afcd155

vllm-bot merged commit e6585dd into vllm-project:main Sep 17, 2025
76 of 79 checks passed

elvischenv deleted the elvischenv/optimize-silu-mul-quant-kernel branch September 18, 2025 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel#24833

[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel#24833
vllm-bot merged 12 commits intovllm-project:mainfrom
elvischenv:elvischenv/optimize-silu-mul-quant-kernel

elvischenv commented Sep 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

pytorch-bot bot commented Sep 14, 2025

Uh oh!

ProExpertProg commented Sep 15, 2025

Uh oh!

mgoin left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	__inline__ __device__ PackedVec<Type> compute_silu(PackedVec<Type>& vec,
	PackedVec<Type>& vec2) {
	PackedVec<Type> result;
	#pragma unroll
	for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
	if constexpr (std::is_same_v<Type, half>) {
	half2 val(0.5f, 0.5f);
	half2 t0 = __hmul2(vec.elts[i], val);
	half2 t1 = __hfma2(h2tanh(t0), val, val);
	half2 t2 = __hmul2(vec.elts[i], t1);
	result.elts[i] = __hmul2(t2, vec2.elts[i]);
	} else {
	__nv_bfloat162 val(0.5f, 0.5f);
	__nv_bfloat162 t0 = __hmul2(vec.elts[i], val);
	__nv_bfloat162 t1 = __hfma2(h2tanh(t0), val, val);
	__nv_bfloat162 t2 = __hmul2(vec.elts[i], t1);
	result.elts[i] = __hmul2(t2, vec2.elts[i]);
	}
	}
	return result;
	}

Uh oh!

Conversation

elvischenv commented Sep 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Unit test:

E2E online lm_eval:

Perf:

Uh oh!

pytorch-bot bot commented Sep 14, 2025

Uh oh!

ProExpertProg commented Sep 15, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elvischenv commented Sep 14, 2025 •

edited by github-actions bot

Loading