Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models by farlukas · Pull Request #789 · ROCm/vllm

farlukas · 2025-11-04T21:02:04Z

Purpose

Enable VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT support for Llama 3 FP8 models. This will use one fused RMSNorm + FP8 per-tensor static quantization kernel instead of two separate RMSNorm and quantization kernel in the decoder layer, particularly before and after self-attention.

This PR depends on ROCm/aiter#1330.

Test Plan

LM Evaluation Harness

MODEL=amd/Llama-3.3-70B-Instruct-FP8-KV
batch_size=8
for p in 250; do
    lm_eval \
        --model local-completions \
        --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=${batch_size},max_retries=10,max_gen_toks=2048 \
        --tasks gsm8k \
        --num_fewshot 5 \
        --batch_size ${batch_size} \
        --limit $p \
        --log_samples \
        --output_path samples \
        2>&1 | tee eval.log
done

E2E Testing

perf_pth=perf
mkdir -p ${perf_pth}

for isl_osl in "1024 1024"; do
    set -- $isl_osl
    isl=$1
    osl=$2
    for concurrency in 64 32 16 8 4; do
        for itr in 1; do
            num_prompts=$(($concurrency * 16))
            python3 /app/vllm/benchmarks/benchmark_serving.py \
                --backend vllm \
                --model amd/Llama-3.3-70B-Instruct-FP8-KV \
                --dataset-name random \
                --num-prompts $num_prompts \
                --random-input $isl \
                --random-output $osl \
                --random-range-ratio 0 \
                --seed 0 \
                --ignore-eos \
                --request-rate $concurrency \
                --max-concurrency $concurrency \
                --percentile_metrics ttft,tpot,itl,e2el \
                --port 8000 | tee -a ${perf_pth}/perf_${concurrency}_${isl}_${osl}.log
        done
    done
done

Test Result

Baseline (without fusion)

VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT=0

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.936	±	0.0155
		strict-match	5	exact_match	↑	0.900	±	0.0190

ISL=1024
OSL=1024	4	8	16	32	64
Request throughput (req/s)	0.42	0.8	1.46	2.87	4.95
Output token throughput (tok/s)	429.55	814.87	1498.34	2940.01	5073.09
Total Token throughput (tok/s)	858.68	1628.06	2994.42	5875.19	10135.55
Median TTFT (ms)	69.24	82.59	162.55	155.82	643.67
Median TPOT (ms)	9.22	9.66	10.5	10.65	11.98
Median ITL (ms)	9.1	9.46	10.22	10.04	11.36
Median E2EL (ms)	9498.33	9963.65	10891.54	11057.29	12894.95

Fusion

VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT=1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.940	±	0.0151
		strict-match	5	exact_match	↑	0.916	±	0.0176

ISL=1024
OSL=1024	4	8	16	32	64
Request throughput (req/s)	0.45	0.86	1.56	3.05	5.25
Output token throughput (tok/s)	460.47	875.57	1596.8	3119.94	5371.61
Total Token throughput (tok/s)	920.49	1749.34	3191.18	6234.74	10731.97
Median TTFT (ms)	73.64	74.95	154.41	146.27	775.05
Median TPOT (ms)	8.59	9	9.84	10.11	11.16
Median ITL (ms)	8.49	8.78	9.57	9.38	10.7
Median E2EL (ms)	8860.84	9281.21	10222.01	10467.64	12187.18

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

…ention

Cherry-pick ROCm#789 Note this needs other changes back ported.

dllehr-amd

Merging!

Cherry-pick ROCm#789 Note this needs other changes back ported.

farlukas added 2 commits November 3, 2025 18:00

Allowed Triton fused RMSNorm and static FP8 quantization for self att…

d62767d

…ention

Integrated fused RMSNorm with static quantization for post attention

186572a

farlukas changed the title ~~Add~~ Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models Nov 4, 2025

farlukas marked this pull request as ready for review November 5, 2025 15:28

vgokhale requested a review from dllehr-amd November 5, 2025 16:54

tpopp added a commit to amdsiloai/vllm that referenced this pull request Nov 20, 2025

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models

6d32d95

Cherry-pick ROCm#789 Note this needs other changes back ported.

dllehr-amd approved these changes Nov 21, 2025

View reviewed changes

dllehr-amd merged commit 0ba4600 into 355_wip Nov 21, 2025
7 of 9 checks passed

tpopp added a commit to amdsiloai/vllm that referenced this pull request Nov 24, 2025

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models

03c0d6a

Cherry-pick ROCm#789 Note this needs other changes back ported.

gshtras deleted the farlukas/355_wip_ll_fp8_fuse_rms_quant branch January 16, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models#789

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models#789
dllehr-amd merged 2 commits into355_wipfrom
farlukas/355_wip_ll_fp8_fuse_rms_quant

farlukas commented Nov 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

dllehr-amd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

farlukas commented Nov 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

LM Evaluation Harness

E2E Testing

Test Result

Baseline (without fusion)

Fusion

Uh oh!

dllehr-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farlukas commented Nov 4, 2025 •

edited by github-actions bot

Loading