Expand deep_gemm entrypoint to support more FP8 recipes.#17294
Expand deep_gemm entrypoint to support more FP8 recipes.#17294zianglih wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zianglih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility of DeepGEMM's FP8 operations by parameterizing the block sizes and scaling factor data types. This generalization moves away from a rigid, hardcoded configuration, enabling broader compatibility with various FP8 quantization schemes, most notably paving the way for seamless integration with MXFP8. The changes primarily involve modifying function signatures and internal tensor dimension calculations to adapt to these new configurable parameters. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request expands the deep_gemm entrypoint to support more FP8 recipes by exposing recipe and sf_dtype parameters. This is a necessary step for MXFP8 integration. The changes are well-contained and correctly plumb the new parameters through the compilation and execution paths of deep_gemm. I've pointed out a minor inconsistency in the warmup executor classes where sf_dtype is not consistently stored. Addressing this would improve code maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request expands the deep_gemm entrypoint to support more FP8 recipes by introducing recipe and sf_dtype parameters. The changes are well-contained and propagate these new parameters through the call stack, from the entrypoint functions down to the warmup executors and tensor creation helpers. This makes the implementation more flexible and prepares it for future integrations like MXFP8.
My review focuses on improving code clarity and maintainability. I've pointed out a few instances of unused variables and duplicated code blocks. Addressing these points will make the code cleaner and easier to maintain in the long run. Overall, this is a good step towards more general FP8 support.
| if sf_dtype is None or sf_dtype == torch.float32: | ||
| sf_storage_elements_per_scale = 1 | ||
| elif sf_dtype == torch.int: | ||
| sf_storage_elements_per_scale = 4 | ||
| else: | ||
| raise ValueError(f"Unimplemented sf_dtype: {sf_dtype}") |
There was a problem hiding this comment.
This logic for handling sf_dtype is duplicated in _empty_block_fp8 (lines 263-268). To improve maintainability and reduce code duplication, consider extracting this logic into a shared helper function. This would also be a good place to make the handling of sf_dtype=None more explicit by defaulting it to torch.float32 at the beginning of the helper.
| if recipe is None: | ||
| block_n = block_k = 128 | ||
| else: | ||
| block_m, block_n, block_k = recipe |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request expands the deep_gemm entrypoint to support more FP8 recipes by passing recipe and sf_dtype through the call stack. This is a good step towards more flexible FP8 support. The changes are mostly correct, but I have identified a few areas for improvement. There is a high-severity issue with memory estimation for warmup, which could lead to OOM errors. The memory calculation needs to be updated to account for the new configurable parameters. Additionally, there is some code duplication that can be refactored for better maintainability, and a small improvement for code clarity by making an implicit default dtype explicit.
| if sf_dtype is None or sf_dtype == torch.float32: | ||
| sf_storage_elements_per_scale = 1 | ||
| elif sf_dtype == torch.int: | ||
| sf_storage_elements_per_scale = 4 | ||
| else: | ||
| raise ValueError(f"Unimplemented sf_dtype: {sf_dtype}") |
There was a problem hiding this comment.
This logic for determining sf_storage_elements_per_scale is duplicated in _empty_block_fp8 (lines 263-268). To improve maintainability and reduce code duplication, consider extracting this logic into a separate helper function.
For example:
def _get_sf_storage_elements_per_scale(sf_dtype: Optional[torch.dtype]) -> int:
if sf_dtype is None or sf_dtype == torch.float32:
return 1
if sf_dtype == torch.int:
return 4
raise ValueError(f"Unimplemented sf_dtype: {sf_dtype}")You can then call this helper function in both _empty_token_fp8 and _empty_block_fp8.
| (*dims, ceil_div(k, _BLOCK_SIZE)), device="cuda", dtype=torch.float32 | ||
| (*dims, ceil_div(k, block_k * sf_storage_elements_per_scale)), | ||
| device="cuda", | ||
| dtype=sf_dtype, |
There was a problem hiding this comment.
While passing dtype=None to torch.empty defaults to torch.float32, making this explicit improves code clarity and robustness. It's better to explicitly handle the None case for sf_dtype. This also applies to _empty_block_fp8 on line 278.
| dtype=sf_dtype, | |
| dtype=sf_dtype or torch.float32, |
Motivation
This is a preliminary step for DeepGEMM + MXFP8 integration. #17093
Currently, SGLang DeepGEMM entrypoint hardcodes
(1, 128, 128)recipe with FP32 scaling factors.This PR exposes
recipeargument in entrypoint, allowing more general FP8 recipes.For example, after deepseek-ai/DeepGEMM#280, MXFP8 can be handled by passing a
(1, 1, 32)recipe.
Modifications
Optionally pass
recipeandsf_dtypeto entrypoint.Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci