[Triton] DS FP4/FP8 Triton fusion and GEMM optimization by k50112113 · Pull Request #119 · ROCm/ATOM

k50112113 · 2026-01-09T02:14:52Z

This PR is co-authored by @k50112113, @omuhamma (#61) and @farlukas (#116)

This PR provides Triton fusion/GEMM optimizations for DS FP4 and FP8,
please use the following AITER branch for testing for now as some of the PRs are yet to be merged to AITER main:
https://github.com/ROCm/aiter/tree/shaoclee/atom_triton_tmp_0106

The required AITER PRs include:

To activate the optimizations on ATOM, the following env variables are required:

# for concurrency > 4, use AR + RMS_Quant + GEMM optimizations:
export ATOM_USE_TRITON_GEMM=1
# note: ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION is turned on automictically when ATOM_USE_TRITON_GEMM is on

# for concurrency = 4, use AR_RMS + Quant_GEMM optimizations:
export ATOM_USE_TRITON_GEMM=1
export ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION=0

The following command along with the above env var are used to derive e2e performance results:

# for DS FP8
python -m atom.entrypoints.openai_server \
    --model /data/deepseek-ai/DeepSeek-R1-0528/ \
    -tp 8 \
    --block-size 1 \
    --server-port 8989 2>&1 | tee server.out

# for DS FP4
export ATOM_USE_TRITON_MXFP4_BMM=1
export AMDGCN_USE_BUFFER_OPS=1
python -m atom.entrypoints.openai_server \
    --model /data/DeepSeek-R1-0528-MXFP4-Preview \
    -tp 8 \
    --block-size 16 \
    --kv_cache_dtype fp8 \
    --server-port 8989 \
    2>&1 | tee server.out

For client command:

MODEL=<DS FP4 or FP8 model paths>
ISL=3500
OSL=1500
PORT=8989
for CONC in 4 256 128 64 32 16 8; do
    RESULT_FILENAME=${ISL}_${OSL}_${CONC}
    python /root/ATOM/atom/benchmarks/benchmark_serving.py \
        --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
        --dataset-name=random \
        --random-input-len=$ISL --random-output-len=$OSL \
        --random-range-ratio 1.0 \
        --num-prompts=$(( $CONC * 8 )) \
        --max-concurrency=$CONC \
        --request-rate=inf --ignore-eos \
        --save-result --percentile-metrics="ttft,tpot,itl,e2el" \
        --result-dir=./ --result-filename=$RESULT_FILENAME.json 2>&1 | tee -a ${RESULT_FILENAME}.log
done

DS FP8 performance comparisons and uplift

DS FP4 performance comparisons and uplift

Making the BMM use fp4 weights

… fused rms

Fused rms

…ON=1

Enable DSR1 FP8 Optimizations

atom/model_ops/linear.py

ChuanLi1101

Overall LGTM, approved for benchmark testing.

atom/model_ops/linear.py

atom/model_ops/attention_mla.py

atom/models/deepseek_v2.py

atom/model_ops/linear.py

atom/model_ops/utils.py

k50112113 · 2026-01-12T15:04:31Z

Addressed all the comments

ChuanLi1101

Approved due to comments has addressed.

* tmp * fix * clean * Making the BMM use fp4 weights * add ATOM_USE_TRITON_GEMM and a16wfp4 gemm for o_proj * Cleaning up the code and ensuring other weights wont crash * add import check for gemm_a16wfp4_preshuffle * clean * clean * disable FP4 triton GEMM on o_proj on DS FP4 * Fused rms for fp4 * Adding the x_scale change in linear.py to choose when to quantize or not * Enabling the second fused rms before attention * Fixing issue where there was a shape mismatch when running the second fused rms * Marking shuffle and shuffle padding as true temporarily always * Working implemenation of fused_rms for fp4 * Formatting fixes * Fix syntax error * Remove some commented code from the fp4 section * disable only AR + input layernorm with ATOM_ENABLE_RMSNORM_QUANT_FUSION=1 * add _fuse_qkv_a_proj_reduce_rmsnorm_quant for DS FP4 * add gemm split + cat for DS FP4 * Integreated fused rmsnorm + quant in decoder layer * No need to fuse post attention * Refactored fusion condition * Transpose scales for input layernorm * Added torch compile guards on fusion to enable torch compiler * Refactored fp8 fused rms quant function * Added fp8 triton preshuffled gemm * Fixed triton gemm condition * Added fused rmsnorm quant fp8 back in * Added transpose_scale back to fp8 fake function * Remove duplicate env * Implemented fp8 gemm preshuffled + split + cat * add back triton fusk_rope_kv_cache * consider both AR_RMS + Quant and AR + RMS_Quant condition via ATOM_ENABLE_RMSNORM_QUANT_FUSION * Implemented fp8 fused reduce rms quant * change boundary * Removed unreachable branch * Added transpose_scale to fused reduce rms quant * fix * clean * add a16w8 preshuffle gemm * clean * change fp8 gemm boundary * triton fp8 gemm rename * remove loader change * remove comments * address comments --------- Co-authored-by: Omar Muhammad <omar.muhammad@amd.com> Co-authored-by: Farel Lukas <farlukas@amd.com>

k50112113 and others added 30 commits December 11, 2025 19:32

tmp

8741164

fix

28b9bc9

clean

1831a5d

Making the BMM use fp4 weights

fcf4daf

add ATOM_USE_TRITON_GEMM and a16wfp4 gemm for o_proj

0f08fae

Cleaning up the code and ensuring other weights wont crash

4058924

add import check for gemm_a16wfp4_preshuffle

3c79ba8

Merge branch 'shaoclee/ds_fp4_gemm' into omuhamma/bmm

dadb3f4

Merge pull request #44 from ROCm/omuhamma/bmm

7c4e8ae

Making the BMM use fp4 weights

clean

b6ee629

clean

2fd1842

disable FP4 triton GEMM on o_proj on DS FP4

53a2ca1

Fused rms for fp4

05c39f5

Adding the x_scale change in linear.py to choose when to quantize or not

8b575e8

Enabling the second fused rms before attention

352d916

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

49b8d14

Fixing issue where there was a shape mismatch when running the second…

1f623a3

… fused rms

Merge branch 'shaoclee/ds_fp4_gemm' into omuhamma/dsfp4-rms

c9e5280

Marking shuffle and shuffle padding as true temporarily always

bec4b20

Working implemenation of fused_rms for fp4

8848851

Formatting fixes

f3b8681

Fix syntax error

35e8338

Remove some commented code from the fp4 section

c14a5d8

Merge pull request #61 from ROCm/omuhamma/dsfp4-rms

98fdbb4

Fused rms

disable only AR + input layernorm with ATOM_ENABLE_RMSNORM_QUANT_FUSI…

dcff526

…ON=1

add _fuse_qkv_a_proj_reduce_rmsnorm_quant for DS FP4

04b1301

add gemm split + cat for DS FP4

ea52f43

Integreated fused rmsnorm + quant in decoder layer

612bd7e

No need to fuse post attention

e52e722

Refactored fusion condition

ec27b85

k50112113 added 6 commits January 8, 2026 04:20

fix

d9fd150

clean

bbd4198

Merge pull request #116 from ROCm/farlukas/dsfp8-fusedrmsnorm

55fafbe

Enable DSR1 FP8 Optimizations

add a16w8 preshuffle gemm

e09758f

clean

19e74a7

change fp8 gemm boundary

26994c2

k50112113 requested review from ChuanLi1101 and valarLip January 9, 2026 02:17

valarLip reviewed Jan 9, 2026

View reviewed changes

atom/model_ops/linear.py Outdated Show resolved Hide resolved

k50112113 added 2 commits January 9, 2026 15:24

triton fp8 gemm rename

357f3ba

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

b6e1b89

k50112113 requested a review from valarLip January 9, 2026 15:47

k50112113 added 2 commits January 9, 2026 16:55

remove loader change

5817f32

remove comments

788dc30

ChuanLi1101 previously approved these changes Jan 10, 2026

View reviewed changes

atom/model_ops/linear.py Show resolved Hide resolved

atom/model_ops/attention_mla.py Outdated Show resolved Hide resolved

atom/models/deepseek_v2.py Outdated Show resolved Hide resolved

atom/model_ops/linear.py Outdated Show resolved Hide resolved

valarLip reviewed Jan 10, 2026

View reviewed changes

atom/model_ops/utils.py Outdated Show resolved Hide resolved

address comments

3446254

k50112113 dismissed ChuanLi1101’s stale review via 3446254 January 12, 2026 15:03

k50112113 requested review from ChuanLi1101 and valarLip and removed request for valarLip January 12, 2026 15:04

k50112113 added 2 commits January 13, 2026 08:27

Merge branch 'main' into shaoclee/ds_fp4_gemm

0b7e98b

Merge remote-tracking branch 'origin/main' into shaoclee/ds_fp4_gemm

d9b116e

ChuanLi1101 approved these changes Jan 14, 2026

View reviewed changes

Merge branch 'main' into shaoclee/ds_fp4_gemm

57aecda

k50112113 merged commit c5fab3e into main Jan 14, 2026
4 checks passed

k50112113 deleted the shaoclee/ds_fp4_gemm branch January 14, 2026 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization#119

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization#119
k50112113 merged 61 commits intomainfrom
shaoclee/ds_fp4_gemm

k50112113 commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k50112113 commented Jan 12, 2026

Uh oh!

ChuanLi1101 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

k50112113 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k50112113 commented Jan 12, 2026

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

k50112113 commented Jan 9, 2026 •

edited

Loading