Skip to content

[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel#24833

Merged
vllm-bot merged 12 commits intovllm-project:mainfrom
elvischenv:elvischenv/optimize-silu-mul-quant-kernel
Sep 17, 2025
Merged

[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel#24833
vllm-bot merged 12 commits intovllm-project:mainfrom
elvischenv:elvischenv/optimize-silu-mul-quant-kernel

Conversation

@elvischenv
Copy link
Copy Markdown
Contributor

@elvischenv elvischenv commented Sep 14, 2025

Purpose

  • Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel

__inline__ __device__ PackedVec<Type> compute_silu(PackedVec<Type>& vec,
PackedVec<Type>& vec2) {
PackedVec<Type> result;
#pragma unroll
for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
if constexpr (std::is_same_v<Type, half>) {
half2 val(0.5f, 0.5f);
half2 t0 = __hmul2(vec.elts[i], val);
half2 t1 = __hfma2(h2tanh(t0), val, val);
half2 t2 = __hmul2(vec.elts[i], t1);
result.elts[i] = __hmul2(t2, vec2.elts[i]);
} else {
__nv_bfloat162 val(0.5f, 0.5f);
__nv_bfloat162 t0 = __hmul2(vec.elts[i], val);
__nv_bfloat162 t1 = __hfma2(h2tanh(t0), val, val);
__nv_bfloat162 t2 = __hmul2(vec.elts[i], t1);
result.elts[i] = __hmul2(t2, vec2.elts[i]);
}
}
return result;
}

  • Do some cleanups for activation_nvfp4_quant_fusion_kernels.cu. We don't need an extra silu_and_cvt_warp_fp16_to_fp4 but just compute_silu_mul + cvt_warp_fp16_to_fp4(reuse from the nvfp4_utils).
  • There were only 3 tests covered the silu_mul + quant fusion in test_silu_mul_quant_fusion.py. Improved the test coverage.
  • Clean up the kernel test test_silu_mul_nvfp4_quant.py. Removed lots of unnecessary components.

Test Plan && Test Result

Unit test:

tests/compile/test_silu_mul_quant_fusion.py

====== 24 passed, 8 skipped, 5 warnings in 11.79s =====

tests/kernels/quantization/test_silu_mul_nvfp4_quant.py

======== 8 passed in 1.44s ======

E2E online lm_eval:

main not fused:

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9257|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.6209|±  |0.0134|

main fused(the accuracy dropped slightly):

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9287|±  |0.0071|
|     |       |strict-match    |     5|exact_match|↑  |0.6035|±  |0.0135|

PR fused:

local-completions (base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/Llama-3.3-70B-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9325|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.6255|±  |0.0133|

Perf:

main not fused:

triton_poi_fused_mul_silu_4: 4.288 μs
cvt_fp16_to_fp4: 4.544 μs

main fused:

silu_mul_cvt_fp16_to_fp4: 6.527 μs

PR fused:

silu_mul_cvt_fp16_to_fp4: 6.624 μs

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the ci/build label Sep 14, 2025
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Sep 14, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@ProExpertProg
Copy link
Copy Markdown
Collaborator

Can you compare to the Inductor-generated fused kernel?

@elvischenv elvischenv changed the title [Perf] Optimize silu_mul + FP8 quant fusion kernel for cuda [Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel Sep 16, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
@elvischenv elvischenv force-pushed the elvischenv/optimize-silu-mul-quant-kernel branch from eb3cce4 to a2bada3 Compare September 16, 2025 17:56
@elvischenv elvischenv marked this pull request as ready for review September 16, 2025 17:56
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as a bugfix, we should merge. @ProExpertProg I don't think we have nvfp4 quant in torch implemented in a serious way, so we should leave that to future work

@mgoin mgoin enabled auto-merge (squash) September 17, 2025 01:42
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
@mgoin mgoin added bug Something isn't working quantization labels Sep 17, 2025
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@vllm-bot vllm-bot merged commit e6585dd into vllm-project:main Sep 17, 2025
76 of 79 checks passed
@elvischenv elvischenv deleted the elvischenv/optimize-silu-mul-quant-kernel branch September 18, 2025 01:12
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
…llm-project#24833)

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…llm-project#24833)

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
…llm-project#24833)

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: charlifu <charlifu@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…llm-project#24833)

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants