[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 by ex-rzr · Pull Request #4368 · ROCm/rocm-libraries

ex-rzr · 2026-02-06T11:42:35Z

Motivation

Technical Details

The microscaling is used when quant scale mode is BlockAttentionQuantScaleEnum::MX and Q/K/P/VDataType are fp8/bf8/fp4.

Supported features:

only "qr" pipeline is implemented ("qr_async" will be added later as a separate PR)
fp8, bf8 and fp4 types
- mixed combinations like fp8/bf8 are possible but never checked
- mixed combinations like fp8/fp4 require some work to be supported
hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding)
both 32x32x64 and 16x16x128 scale MFMAs are supported
Q and K scales are applied in hdim, V scales - in seqlen dimension
column-major V only
batch and group mode
bias, Alibi (tested but no instances by default, just like fp8)
masking etc.

Aiter PR with new API args: ROCm/aiter#2008

Test Plan

ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4

Test Result

The tests must pass.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

With "uf" results are too close to zero, so with high tolerance even completely incorrect values pass the check.

LSE is computed and stored in fp32, so there is no precision loss due to conversion. Also checking it before OUT helps to understand where the error happens (first or second GEMM).

…t scales hdim 128 and 256 and "qr" pipeline pass simple tests.

Almost unchanged, MX-related changes will be done later.

This change breaks the second GEMM with 32x32x64.

K and K scales need shuffling so P+P scales and V+V scales have consistent distributions.

It does not need `* (1 / scale)`.

This requires multiplications for P and then O. It may not very important for fp8 because almost half of positive values are in the [0..1] range. But for fp4 it is necessary bacause only 3 values of 8 are in this range: 0, 0.5, 1.

Warning! Policy is not yet updated for packed types that likely means: * LDS size is twice as large as actually needed; * DRAM/LDS size granularity is not optimal (64 bit, not 128).

By default, there are no fp8/bf8 instances with bias, but if they are enabled for testing purposes, the bias tests fail. BiasDataType can be as large as fp32, numeric<BiasDataType>::max() is too large. Bias is applied to the S matrix, the values of which are quite small so bias values must have a similar magnitude.

Instead of shuffling K and K scales using DRAM views, it's now done in BlockGemmMx so the kernel/pipeline don't need to know about such details. This also solves the issue with masking, bias and alibi because they work with the layout of the first GEMM's C matrix which is now also modified accordingly.

Even with padding alignment must be at least 2.

The default invalid_element_value (numeric<e8m0_t>>::zero()) is NaN, this causes NaNs in MFMA results when inputs are padded.

The algorithm described in "OCP Microscaling Formats (MX) Specification" have flaws: * it requires clamping which is not done automatically by v_scalef32_pk_fp8_f32. * for fp4 it has high quantization error, e.g. max_abs = 0.99 is quantized to 6.0 which is 0.75 after dequantization (especially bad for softmax results (0; 1]). The previous implementation was better than the OCP one but it didn't use the whole fp4 range for some values losing precision for small values in the block.

Vector of e8m0_t cannot be constructed from e8m0_t because it has two conversion operators: from float and e8m0_t::type.

bartekxk · 2026-02-27T10:34:57Z

Part from amd buffer load looks good to me

ex-rzr · 2026-03-09T08:35:42Z

@poyenc, @DDEle Hi, do you have other review comments?

DDEle

Otherwise looks good.

Triggered aiter tests:

http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/users%2Fex-rzr%2Fck%2Ffmha-mx-support-on-gfx950/6/pipeline

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ex-rzr · 2026-03-11T10:13:22Z

My previous manual run of Aiter tests: pipepline

ROCm/aiter#2008 has been update to the latest CK commit, waiting for Aiter's CI...

…m#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc)

…el API changes (#2363) * update ck * update ck * before gpt-oss sink * gpt-oss sink * Add missing parameter * Fix typo * Update to ROCm/composable_kernel@b09112b * add -Wno-unknown-warning-option * Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc) * Update to ROCm/rocm-libraries@a358a21 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: Yi DING <andy-ding@outlook.com>

…m#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…el API changes (Dao-AILab#2363) * update ck * update ck * before gpt-oss sink * gpt-oss sink * Add missing parameter * Fix typo * Update to ROCm/composable_kernel@b09112b * add -Wno-unknown-warning-option * Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc) * Update to ROCm/rocm-libraries@a358a21 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: Yi DING <andy-ding@outlook.com>

ex-rzr added 30 commits February 6, 2026 15:45

Use correct init method for fp8

20f3726

With "uf" results are too close to zero, so with high tolerance even completely incorrect values pass the check.

Use smaller tolerance for LSE, check if before OUT

aaa6777

LSE is computed and stored in fp32, so there is no precision loss due to conversion. Also checking it before OUT helps to understand where the error happens (first or second GEMM).

Create a base for MX FMHA: FP8 with 16x16x128 & 32x32x64 tiles withou…

c7f724b

…t scales hdim 128 and 256 and "qr" pipeline pass simple tests.

Extend enums, structs and args with mx-related values

5f1eb84

Implement host side: generation of scales, validation with mx gemm

026d1a2

Pass dram windows for scales from kernel to pipeline

a0037ae

Clone BlockGemmARegBSmemCRegV2 as BlockGemmMxARegBSmemCRegV1

a58354c

Almost unchanged, MX-related changes will be done later.

Support scales in WarpGemmAttributeMfma...

cad0d3d

Add scales for Q to the first GEMM

7a86e3d

This change breaks the second GEMM with 32x32x64.

Add scales for K to the first GEMM

330b3c9

Add scales for V to the second GEMM

e26f744

K and K scales need shuffling so P+P scales and V+V scales have consistent distributions.

Shuffle K during loading from DRAM instead of LDS

ea3e97e

Implement simple calculation of scales for P

fea9090

Use cvt_scalef32_pk_fp8_f32 for P

3a6d38a

It does not need `* (1 / scale)`.

Use full fp8 range for P (448.0 instead of 1.0)

c66c8f5

This requires multiplications for P and then O. It may not very important for fp8 because almost half of positive values are in the [0..1] range. But for fp4 it is necessary bacause only 3 values of 8 are in this range: 0, 0.5, 1.

Fix cases when N0 != K1 (k1_loops > 1)

61db443

Support hdim=128 and 32x32x64 MFMA

ab7f751

Support bf8

c97cb10

Fix get_y_sliced_thread_data (and hence get_slice_tile) for pk_fp4_t

dfc82f4

Support fp4

a8ace9b

Warning! Policy is not yet updated for packed types that likely means: * LDS size is twice as large as actually needed; * DRAM/LDS size granularity is not optimal (64 bit, not 128).

Extract P mx casting into a separate function

c0c3c05

Add fp4 traits, instances and tests

d526919

Fix errors after rebasing onto recent blockscale changes

e9437d9

Fix alignment of Q, K, V for fp4

4625670

Even with padding alignment must be at least 2.

Replace NaN e8m0_t with 0 for invalid (padded) scale values

0ac81de

The default invalid_element_value (numeric<e8m0_t>>::zero()) is NaN, this causes NaNs in MFMA results when inputs are padded.

Implement group mode

1b78f66

Use PackedSize for pointers modified with head/batch offsets

496c1a5

Ensure that hdim_q and seqlen_k are even for fp4

4aecdc4

Fix rounding seqlen to even for mxfp4 when seqlen_k = -1

d796f65

illsilin assigned poyenc Feb 20, 2026

ex-rzr added 6 commits February 21, 2026 15:09

Merge branch 'develop'

4a6456c

Merge branch 'develop'

06b1368

Merge branch 'develop'

ebbadc4

Merge branch 'develop'

63aebd6

Fix ambiguity in oob loading with customized_value for e8m0_t

2a378cd

Vector of e8m0_t cannot be constructed from e8m0_t because it has two conversion operators: from float and e8m0_t::type.

bartekxk approved these changes Feb 27, 2026

View reviewed changes

Merge branch 'develop'

591df34

DDEle approved these changes Mar 10, 2026

View reviewed changes

Comment thread projects/composablekernel/test/ck_tile/fmha/CMakeLists.txt

ex-rzr added 3 commits March 10, 2026 12:36

Merge branch 'develop'

6be97d6

Remove duplicated code

911dd5d

Merge branch 'develop'

477ebee

ex-rzr enabled auto-merge (squash) March 11, 2026 09:38

ex-rzr merged commit 17f7dfc into develop Mar 11, 2026
45 of 47 checks passed

ex-rzr deleted the users/ex-rzr/ck/fmha-mx-support-on-gfx950 branch March 11, 2026 09:59

DDEle added a commit to ROCm/flash-attention that referenced this pull request Mar 12, 2026

Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc)

6334739

DDEle mentioned this pull request Mar 12, 2026

Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc) ROCm/flash-attention#176

Merged

1 task

rocking5566 added a commit to ROCm/flash-attention that referenced this pull request Mar 12, 2026

Merge pull request #176 from ROCm/update-api-new-mx-params

8e8921b

Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368
ex-rzr merged 71 commits intodevelopfrom
users/ex-rzr/ck/fmha-mx-support-on-gfx950

ex-rzr commented Feb 6, 2026 •

edited

Loading

Uh oh!

bartekxk commented Feb 27, 2026

Uh oh!

ex-rzr commented Mar 9, 2026

Uh oh!

DDEle left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ex-rzr commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ex-rzr commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

bartekxk commented Feb 27, 2026

Uh oh!

ex-rzr commented Mar 9, 2026

Uh oh!

DDEle left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ex-rzr commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ex-rzr commented Feb 6, 2026 •

edited

Loading

DDEle left a comment •

edited

Loading