fused rmsnorm quant fp8 work by yctseng0211 · Pull Request #2 · yctseng0211/sglang

yctseng0211 · 2025-10-30T07:09:43Z

Motivation

Fusion of rmsnorm + group scale FP8 quantization (group_size=128)
This PR use the triton kernel from main branch of Rocm/aiter : ROCm/aiter#1148

w/o fusion

w/ fusion

Modifications

deepseek.py

LayerCommunicator (for hidden states)
DeepseekV2AttentionMLA (for q, k_)

f8.py
fp8_utils.py

aiter_w8a8_block_fp8_linear: skip quantization here when x is tuple

Accuracy Tests

python3 -m sglang.launch_server
--model-path /DeepSeek-R1-0528/
--quantization fp8
--tensor-parallel-size 8
--trust-remote-code
--chunked-prefill-size 131072
--host 0.0.0.0
--port 8008
--log-requests
--disable-radix-cache
--mem-fraction-static 0.95

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --port 8008

w/. fused_rms_quant at LayerCommunicator and DeepseekV2AttentionMLA
Accuracy: 0.961
Invalid: 0.000
Latency: 55.205 s
Output throughput: 2412.560 token/s

w/. fused_rms_quant at LayerCommunicator
Accuracy: 0.960
Invalid: 0.000
Latency: 55.688 s
Output throughput: 2366.073 token/s

w/o fused_rms_quant
Accuracy: 0.957
Invalid: 0.000
Latency: 57.499 s
Output throughput: 2302.590 token/s

Benchmarking and Profiling

docker image : docker.io/rocm/sgl-dev:v0.5.3.post2-rocm700-mi35x-20251016
Model : DeepSeek-R1-0528/

Server Mode

Concurrency	/DeepSeek-R1-0528/ docker.io/rocm/sgl-dev:v0.5.3.post2-rocm700-mi35x-20251016				wo fused / w fused
	MI355 SGLang + spec		MI355 SGLang + spec + rms quant fused		Latency (%)	Throughput (%)
	Latency (ms)	Throughput (tok/s)	Latency (ms)	Throughput (tok/s)
1	6214.4	614.0	6176.89	647.3	101%	105%
2	6577.2	1179.1	6481.13	1153.1	101%	98%
4	7595.5	2034.4	7271.11	2029.8	104%	100%
8	8407.2	3522.0	8338.99	3645.4	101%	104%
16	10511.4	5822.1	10519.08	5755.3	100%	99%
32	13928.9	8739.0	13749.6	8856.5	101%	101%
64	23414.9	10482.2	22691.04	10724.9	103%	102%

Offline Mode

Batch Size	Input Size	Output Size	Prefill Latency (%)	Decode Latency (%)
1	256	2048	82%	103%
2	256	2048	81%	103%
4	256	2048	69%	103%
8	256	2048	101%	102%
16	256	2048	101%	101%
32	256	2048	101%	100%
64	256	2048	101%	101%
128	256	2048	101%	100%
256	256	2048	101%	102%
512	256	2048	101%	101%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…v_a using fused_rms_fp8_group_quant

root added 2 commits October 30, 2025 07:03

Initial commit of fused RMS FP8 development work

903577b

feat(DeepseekV2AttentionMLA): add fused_rms_fp8 path for q, k_nope, k…

167ed6e

…v_a using fused_rms_fp8_group_quant

yctseng0211 changed the title ~~Initial commit of fused RMS FP8 development work~~ fused rmsnorm quant fp8 work Oct 30, 2025

yctseng0211 added 2 commits October 31, 2025 12:25

precommit format

3ddcced

reverse bench_one_batch.py

5f36319

yctseng0211 closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fused rmsnorm quant fp8 work#2

fused rmsnorm quant fp8 work#2
yctseng0211 wants to merge 4 commits intomainfrom
dev-fuse_rms_f8_quant

yctseng0211 commented Oct 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yctseng0211 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Server Mode

Offline Mode

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yctseng0211 commented Oct 30, 2025 •

edited

Loading