Skip to content

Conversation

@Liu-congo
Copy link
Contributor

@Liu-congo Liu-congo commented Oct 9, 2025

Motivation

fix #11272

Modifications

replace the per_tensor_quant_mla_fp8 used in dsv2 with the to_float8 in sgl-kernel/tests/test_bmm_fp8.py

Benchmarking

cmd: python3 test/srt/test_mla_fp8.py
raw implementation on cublas12.8.4.1
{'en': np.float64(0.86), 'en:std': np.float64(0.34698703145794946), 'group_latin': np.float64(0.86), 'group_latin:std': np.float64(0.34698703145794946), 'score:std': np.float64(0.34698703145794946), 'score': np.float64(0.86)}
Total latency: 26.120 s
Score: 0.860

raw implementation on cublas12.9.1.4
failed

current implementation on cublas12.8.4.1:
{'en': np.float64(0.888), 'en:std': np.float64(0.31536645351083237), 'group_latin': np.float64(0.888), 'group_latin:std': np.float64(0.31536645351083237), 'score:std': np.float64(0.31536645351083237), 'score': np.float64(0.888)}
Total latency: 14.972 s(very magic, adding 3 more test result: 30s, 26s, 29s, the true performance might be around 28s?)
Score: 0.888

current implementation on cublas12.91.4
{'en': np.float64(0.864), 'en:std': np.float64(0.3427885645700568), 'group_latin': np.float64(0.864), 'group_latin:std': np.float64(0.3427885645700568), 'score:std': np.float64(0.3427885645700568), 'score': np.float64(0.864)}
Total latency: 25.954 s
Score: 0.864

Plan

I think the problem might caused by the triton implemented per_tensor_quant_mla_fp8, further

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Liu-congo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This PR addresses a critical bug preventing test_mla_fp8.py from running on Cublas 12.9. The fix involves replacing the per_tensor_quant_mla_fp8 function with input_to_float8 in the DeepSeekV2 model's quantization logic, ensuring compatibility and functionality across different Cublas versions. Benchmarking results confirm the successful resolution of the issue.

Highlights

  • Intent: This pull request aims to fix a bug where test_mla_fp8.py fails when running on Cublas 12.9. The issue is resolved by replacing the per_tensor_quant_mla_fp8 function with input_to_float8 in the DeepSeekV2 model implementation.
  • Changes: The primary change involves modifying python/sglang/srt/models/deepseek_v2.py. The per_tensor_quant_mla_fp8 function, previously used for FP8 quantization, has been replaced with input_to_float8 in both the forward_absorb_prepare and forward_absorb_core methods. Additionally, input_to_float8 is now imported, and a .contiguous() call has been added to the quantized tensor (q_nope_val and attn_output_val) after the input_to_float8 conversion.
  • Reviewer Activity: No specific reviewer activity has been recorded or provided in the context for this pull request.
  • Benchmarking: Benchmarking results provided by the author confirm the fix. The original implementation failed on Cublas 12.9.1.4, while the new implementation successfully runs on both Cublas 12.8.4.1 and 12.9.1.4, showing comparable scores and latencies across versions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in test_mla_fp8.py that occurs with Cublas 12.9 by replacing the Triton-based per_tensor_quant_mla_fp8 function with a PyTorch-based input_to_float8 function. The change appears to be correct and also shows performance improvements in the benchmarks provided.

I've identified a couple of areas for minor improvement related to redundant code, which I've detailed in the comments below. These changes will help improve code clarity.



# temporary fix for issue #11272
def is_nvidia_cublas_cu12_version_ge_12_9():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move this function to python/sglang/srt/utils/common.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It's done.

Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Fridge003 Fridge003 merged commit c80a96d into sgl-project:main Oct 11, 2025
90 of 104 checks passed
@Liu-congo Liu-congo deleted the bmm_fp8_fix branch October 11, 2025 04:26
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] test_mla_fp8.py fails on Cublas 12.9

2 participants