[AMD] Fix GLM-5 fp8 KV quant path dispatch on MI300#22314
Merged
HaiShaw merged 4 commits intosgl-project:mainfrom Apr 8, 2026
Merged
[AMD] Fix GLM-5 fp8 KV quant path dispatch on MI300#22314HaiShaw merged 4 commits intosgl-project:mainfrom
HaiShaw merged 4 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
kkHuang-amd
reviewed
Apr 8, 2026
| if ( | ||
| _is_hip | ||
| and self.use_nsa | ||
| and self.dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz) |
Collaborator
There was a problem hiding this comment.
You can import "from sglang.srt.layers.quantization.fp8_kernel import fp8_dtype"
and use "self.dtype == fp8_dtype" to do condition check.
fp8_dtype is torch.float8_e4m3fnuz on mi300x and torch.float8_e4m3fn on mi35x
Collaborator
Author
There was a problem hiding this comment.
Fixed and reran well. Really appreciate the reminder.
kkHuang-amd
reviewed
Apr 8, 2026
| ): | ||
| # HIP FP8 path uses raw MLA KV layout (nope + rope) without per-block scales. | ||
| # Fuse BF16/FP16 -> FP8 cast with paged KV write. | ||
| fp8_dtype = torch.float8_e4m3fnuz if _is_fp8_fnuz else torch.float8_e4m3fn |
Collaborator
There was a problem hiding this comment.
Remove Line 1585, when you use from sglang.srt.layers.quantization.fp8_kernel import fp8_dtype
HaiShaw
approved these changes
Apr 8, 2026
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
On MI300, running GLM-5-fp8 with FP8 KV cache can fail (see CI log).
The root cause is that the quant path does not dispatch the correct kernel (
set_mla_kv_buffer_triton_fp8_quant).Modifications
The flag
self.nsa_kv_cache_store_fp8is true only when KV cache is stored in fp8 with scaling. Our attention path uses fp8 KV cache without scaling, so it should not be gated by this flag.This change moves the HIP + fp8 quant path out of the scaling-specific branch, ensuring MI300 dispatches the correct fused kernel (
set_mla_kv_buffer_triton_fp8_quant).This change only affects the MI300 code path.
Accuracy Tests
GLM-5-fp8 with fp8 kvcache Accuracy: 0.945
Also validated with the new CI script
test_glm5_perf_amd.pyprepared in PR #21710.Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci