Skip to content

fused rmsnorm quant fp8 work#2

Closed
yctseng0211 wants to merge 4 commits intomainfrom
dev-fuse_rms_f8_quant
Closed

fused rmsnorm quant fp8 work#2
yctseng0211 wants to merge 4 commits intomainfrom
dev-fuse_rms_f8_quant

Conversation

@yctseng0211
Copy link
Copy Markdown
Owner

@yctseng0211 yctseng0211 commented Oct 30, 2025

Motivation

Fusion of rmsnorm + group scale FP8 quantization (group_size=128)
This PR use the triton kernel from main branch of Rocm/aiter : ROCm/aiter#1148

w/o fusion image

w/ fusion
image

Modifications

deepseek.py

  • LayerCommunicator (for hidden states)
  • DeepseekV2AttentionMLA (for q, k_)

f8.py
fp8_utils.py

  • aiter_w8a8_block_fp8_linear: skip quantization here when x is tuple

Accuracy Tests

python3 -m sglang.launch_server
--model-path /DeepSeek-R1-0528/
--quantization fp8
--tensor-parallel-size 8
--trust-remote-code
--chunked-prefill-size 131072
--host 0.0.0.0
--port 8008
--log-requests
--disable-radix-cache
--mem-fraction-static 0.95 

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --port 8008

w/. fused_rms_quant at LayerCommunicator and DeepseekV2AttentionMLA
Accuracy: 0.961
Invalid: 0.000
Latency: 55.205 s
Output throughput: 2412.560 token/s

w/. fused_rms_quant at LayerCommunicator
Accuracy: 0.960
Invalid: 0.000
Latency: 55.688 s
Output throughput: 2366.073 token/s

w/o fused_rms_quant
Accuracy: 0.957
Invalid: 0.000
Latency: 57.499 s
Output throughput: 2302.590 token/s

Benchmarking and Profiling

docker image : docker.io/rocm/sgl-dev:v0.5.3.post2-rocm700-mi35x-20251016
Model : DeepSeek-R1-0528/

Server Mode

Concurrency /DeepSeek-R1-0528/
docker.io/rocm/sgl-dev:v0.5.3.post2-rocm700-mi35x-20251016
wo fused / w fused
MI355 SGLang + spec MI355 SGLang + spec + rms quant fused Latency (%) Throughput (%)
Latency (ms) Throughput (tok/s) Latency (ms) Throughput (tok/s)
16214.4614.06176.89647.3101%105%
26577.21179.16481.131153.1101%98%
47595.52034.47271.112029.8104%100%
88407.23522.08338.993645.4101%104%
1610511.45822.110519.085755.3100%99%
3213928.98739.013749.68856.5101%101%
6423414.910482.222691.0410724.9103%102%

Offline Mode

Batch Size Input Size Output Size Prefill Latency (%) Decode Latency (%)
1 256 2048 82% 103%
2 256 2048 81% 103%
4 256 2048 69% 103%
8 256 2048 101% 102%
16 256 2048 101% 101%
32 256 2048 101% 100%
64 256 2048 101% 101%
128 256 2048 101% 100%
256 256 2048 101% 102%
512 256 2048 101% 101%

Checklist

@yctseng0211 yctseng0211 changed the title Initial commit of fused RMS FP8 development work fused rmsnorm quant fp8 work Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant