[BugFix] test_mla_fp8.py fails on Cublas 12.9 #11360

Liu-congo · 2025-10-09T03:01:26Z

Motivation

Modifications

replace the per_tensor_quant_mla_fp8 used in dsv2 with the to_float8 in sgl-kernel/tests/test_bmm_fp8.py

Benchmarking

cmd: python3 test/srt/test_mla_fp8.py
raw implementation on cublas12.8.4.1
{'en': np.float64(0.86), 'en:std': np.float64(0.34698703145794946), 'group_latin': np.float64(0.86), 'group_latin:std': np.float64(0.34698703145794946), 'score:std': np.float64(0.34698703145794946), 'score': np.float64(0.86)}
Total latency: 26.120 s
Score: 0.860

raw implementation on cublas12.9.1.4
failed

current implementation on cublas12.8.4.1:
{'en': np.float64(0.888), 'en:std': np.float64(0.31536645351083237), 'group_latin': np.float64(0.888), 'group_latin:std': np.float64(0.31536645351083237), 'score:std': np.float64(0.31536645351083237), 'score': np.float64(0.888)}
Total latency: 14.972 s(very magic, adding 3 more test result: 30s, 26s, 29s, the true performance might be around 28s?)
Score: 0.888

current implementation on cublas12.91.4
{'en': np.float64(0.864), 'en:std': np.float64(0.3427885645700568), 'group_latin': np.float64(0.864), 'group_latin:std': np.float64(0.3427885645700568), 'score:std': np.float64(0.3427885645700568), 'score': np.float64(0.864)}
Total latency: 25.954 s
Score: 0.864

Plan

I think the problem might caused by the triton implemented per_tensor_quant_mla_fp8, further

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
[ x ] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

cublas 12.9 Signed-off-by: Liu-congo <[email protected]>

Signed-off-by: Liu-congo <[email protected]>

gemini-code-assist · 2025-10-09T03:01:37Z

Summary of Changes

Hello @Liu-congo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This PR addresses a critical bug preventing test_mla_fp8.py from running on Cublas 12.9. The fix involves replacing the per_tensor_quant_mla_fp8 function with input_to_float8 in the DeepSeekV2 model's quantization logic, ensuring compatibility and functionality across different Cublas versions. Benchmarking results confirm the successful resolution of the issue.

Highlights

Intent: This pull request aims to fix a bug where test_mla_fp8.py fails when running on Cublas 12.9. The issue is resolved by replacing the per_tensor_quant_mla_fp8 function with input_to_float8 in the DeepSeekV2 model implementation.
Changes: The primary change involves modifying python/sglang/srt/models/deepseek_v2.py. The per_tensor_quant_mla_fp8 function, previously used for FP8 quantization, has been replaced with input_to_float8 in both the forward_absorb_prepare and forward_absorb_core methods. Additionally, input_to_float8 is now imported, and a .contiguous() call has been added to the quantized tensor (q_nope_val and attn_output_val) after the input_to_float8 conversion.
Reviewer Activity: No specific reviewer activity has been recorded or provided in the context for this pull request.
Benchmarking: Benchmarking results provided by the author confirm the fix. The original implementation failed on Cublas 12.9.1.4, while the new implementation successfully runs on both Cublas 12.8.4.1 and 12.9.1.4, showing comparable scores and latencies across versions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix a bug in test_mla_fp8.py that occurs with Cublas 12.9 by replacing the Triton-based per_tensor_quant_mla_fp8 function with a PyTorch-based input_to_float8 function. The change appears to be correct and also shows performance improvements in the benchmarks provided.

I've identified a couple of areas for minor improvement related to redundant code, which I've detailed in the comments below. These changes will help improve code clarity.

python/sglang/srt/models/deepseek_v2.py

Signed-off-by: Liu-congo <[email protected]>

…o bmm_fp8_fix

Signed-off-by: Liu-congo <[email protected]>

…o bmm_fp8_fix

Fridge003 · 2025-10-09T16:38:55Z

python/sglang/srt/models/deepseek_v2.py



+# temporary fix for issue #11272
+def is_nvidia_cublas_cu12_version_ge_12_9():


Can you please move this function to python/sglang/srt/utils/common.py

Got it. It's done.

Signed-off-by: Liu-congo <[email protected]>

…o bmm_fp8_fix

Fridge003

LGTM

Signed-off-by: Liu-congo <[email protected]>

Liu-congo added 2 commits October 9, 2025 10:23

modify the fp8quant method for ds2mla head to adapt to bmm_fp8 for

a2c34a5

cublas 12.9 Signed-off-by: Liu-congo <[email protected]>

code reformat

1729da2

Signed-off-by: Liu-congo <[email protected]>

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

remove unused .contiguous() after input_to_float8

0396a13

Liu-congo mentioned this pull request Oct 9, 2025

[Bug] test_mla_fp8.py fails on Cublas 12.9 #11272

Closed

5 tasks

Merge branch 'main' into bmm_fp8_fix

ec5c520

Fridge003 requested changes Oct 9, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

Liu-congo and others added 5 commits October 9, 2025 22:02

add special judge for cublas12.9 to maintain performance

aa5c275

Signed-off-by: Liu-congo <[email protected]>

Merge branch 'bmm_fp8_fix' of https://github.com/Liu-congo/sglang int…

85d989e

…o bmm_fp8_fix

Merge branch 'main' into bmm_fp8_fix

e5ffc6f

add todo for fixing the triton of per_token_quant_mla_fp8

d508766

Signed-off-by: Liu-congo <[email protected]>

Merge branch 'bmm_fp8_fix' of https://github.com/Liu-congo/sglang int…

f7eec48

…o bmm_fp8_fix

Fridge003 reviewed Oct 9, 2025

View reviewed changes

Liu-congo and others added 3 commits October 10, 2025 12:14

Merge branch 'main' into bmm_fp8_fix

2c23687

move version check to srt/utils/common

5f02def

Signed-off-by: Liu-congo <[email protected]>

Merge branch 'bmm_fp8_fix' of https://github.com/Liu-congo/sglang int…

b6659d2

…o bmm_fp8_fix

Fridge003 approved these changes Oct 10, 2025

View reviewed changes

Fridge003 added the run-ci label Oct 10, 2025

Liu-congo and others added 3 commits October 10, 2025 14:33

Merge branch 'main' into bmm_fp8_fix

e0ecef2

Merge branch 'main' into bmm_fp8_fix

5f13c02

Merge branch 'main' into bmm_fp8_fix

d4ee6e0

Fridge003 merged commit c80a96d into sgl-project:main Oct 11, 2025
90 of 104 checks passed

Liu-congo deleted the bmm_fp8_fix branch October 11, 2025 04:26

Liu-congo mentioned this pull request Oct 14, 2025

[BugFix] replace the input_to_float8 used in dsv2 #11612

Merged

1 task

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

[BugFix] test_mla_fp8.py fails on Cublas 12.9 (sgl-project#11360)

e032b4f

Signed-off-by: Liu-congo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] test_mla_fp8.py fails on Cublas 12.9 #11360

[BugFix] test_mla_fp8.py fails on Cublas 12.9 #11360

Uh oh!

Liu-congo commented Oct 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 Oct 9, 2025

Uh oh!

Liu-congo Oct 10, 2025

Uh oh!

Fridge003 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# temporary fix for issue #11272
		def is_nvidia_cublas_cu12_version_ge_12_9():

[BugFix] test_mla_fp8.py fails on Cublas 12.9 #11360

[BugFix] test_mla_fp8.py fails on Cublas 12.9 #11360

Uh oh!

Conversation

Liu-congo commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking

Plan

Checklist

Uh oh!

gemini-code-assist bot commented Oct 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Liu-congo Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Liu-congo commented Oct 9, 2025 •

edited

Loading