Skip to content

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models#789

Merged
dllehr-amd merged 2 commits into355_wipfrom
farlukas/355_wip_ll_fp8_fuse_rms_quant
Nov 21, 2025
Merged

Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models#789
dllehr-amd merged 2 commits into355_wipfrom
farlukas/355_wip_ll_fp8_fuse_rms_quant

Conversation

@farlukas
Copy link

@farlukas farlukas commented Nov 4, 2025

Purpose

Enable VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT support for Llama 3 FP8 models. This will use one fused RMSNorm + FP8 per-tensor static quantization kernel instead of two separate RMSNorm and quantization kernel in the decoder layer, particularly before and after self-attention.

This PR depends on ROCm/aiter#1330.

Test Plan

LM Evaluation Harness

MODEL=amd/Llama-3.3-70B-Instruct-FP8-KV
batch_size=8
for p in 250; do
    lm_eval \
        --model local-completions \
        --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=${batch_size},max_retries=10,max_gen_toks=2048 \
        --tasks gsm8k \
        --num_fewshot 5 \
        --batch_size ${batch_size} \
        --limit $p \
        --log_samples \
        --output_path samples \
        2>&1 | tee eval.log
done

E2E Testing

perf_pth=perf
mkdir -p ${perf_pth}

for isl_osl in "1024 1024"; do
    set -- $isl_osl
    isl=$1
    osl=$2
    for concurrency in 64 32 16 8 4; do
        for itr in 1; do
            num_prompts=$(($concurrency * 16))
            python3 /app/vllm/benchmarks/benchmark_serving.py \
                --backend vllm \
                --model amd/Llama-3.3-70B-Instruct-FP8-KV \
                --dataset-name random \
                --num-prompts $num_prompts \
                --random-input $isl \
                --random-output $osl \
                --random-range-ratio 0 \
                --seed 0 \
                --ignore-eos \
                --request-rate $concurrency \
                --max-concurrency $concurrency \
                --percentile_metrics ttft,tpot,itl,e2el \
                --port 8000 | tee -a ${perf_pth}/perf_${concurrency}_${isl}_${osl}.log
        done
    done
done

Test Result

Baseline (without fusion)

VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT=0

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.936 ± 0.0155
strict-match 5 exact_match 0.900 ± 0.0190
ISL=1024
OSL=1024 4 8 16 32 64
Request throughput (req/s) 0.42 0.8 1.46 2.87 4.95
Output token throughput (tok/s) 429.55 814.87 1498.34 2940.01 5073.09
Total Token throughput (tok/s) 858.68 1628.06 2994.42 5875.19 10135.55
Median TTFT (ms) 69.24 82.59 162.55 155.82 643.67
Median TPOT (ms) 9.22 9.66 10.5 10.65 11.98
Median ITL (ms) 9.1 9.46 10.22 10.04 11.36
Median E2EL (ms) 9498.33 9963.65 10891.54 11057.29 12894.95

Fusion

VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT=1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.940 ± 0.0151
strict-match 5 exact_match 0.916 ± 0.0176
ISL=1024
OSL=1024 4 8 16 32 64
Request throughput (req/s) 0.45 0.86 1.56 3.05 5.25
Output token throughput (tok/s) 460.47 875.57 1596.8 3119.94 5371.61
Total Token throughput (tok/s) 920.49 1749.34 3191.18 6234.74 10731.97
Median TTFT (ms) 73.64 74.95 154.41 146.27 775.05
Median TPOT (ms) 8.59 9 9.84 10.11 11.16
Median ITL (ms) 8.49 8.78 9.57 9.38 10.7
Median E2EL (ms) 8860.84 9281.21 10222.01 10467.64 12187.18

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

@farlukas farlukas changed the title Add Add Fused RMSNorm + FP8 Per-tensor Static Quantization to Llama 3 Models Nov 4, 2025
@farlukas farlukas marked this pull request as ready for review November 5, 2025 15:28
@vgokhale vgokhale requested a review from dllehr-amd November 5, 2025 16:54
tpopp added a commit to amdsiloai/vllm that referenced this pull request Nov 20, 2025
Cherry-pick ROCm#789

Note this needs other changes back ported.
Copy link
Collaborator

@dllehr-amd dllehr-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging!

@dllehr-amd dllehr-amd merged commit 0ba4600 into 355_wip Nov 21, 2025
7 of 9 checks passed
tpopp added a commit to amdsiloai/vllm that referenced this pull request Nov 24, 2025
Cherry-pick ROCm#789

Note this needs other changes back ported.
@gshtras gshtras deleted the farlukas/355_wip_ll_fp8_fuse_rms_quant branch January 16, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants