Skip to content

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368

Merged
ex-rzr merged 71 commits intodevelopfrom
users/ex-rzr/ck/fmha-mx-support-on-gfx950
Mar 11, 2026
Merged

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368
ex-rzr merged 71 commits intodevelopfrom
users/ex-rzr/ck/fmha-mx-support-on-gfx950

Conversation

@ex-rzr
Copy link
Copy Markdown
Contributor

@ex-rzr ex-rzr commented Feb 6, 2026

Motivation

Technical Details

The microscaling is used when quant scale mode is BlockAttentionQuantScaleEnum::MX and Q/K/P/VDataType are fp8/bf8/fp4.

Supported features:

  • only "qr" pipeline is implemented ("qr_async" will be added later as a separate PR)
  • fp8, bf8 and fp4 types
    • mixed combinations like fp8/bf8 are possible but never checked
    • mixed combinations like fp8/fp4 require some work to be supported
  • hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding)
  • both 32x32x64 and 16x16x128 scale MFMAs are supported
  • Q and K scales are applied in hdim, V scales - in seqlen dimension
  • column-major V only
  • batch and group mode
  • bias, Alibi (tested but no instances by default, just like fp8)
  • masking etc.

Aiter PR with new API args: ROCm/aiter#2008

Test Plan

ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4

Test Result

The tests must pass.

Submission Checklist

ex-rzr added 30 commits February 6, 2026 15:45
With "uf" results are too close to zero, so with high tolerance
even completely incorrect values pass the check.
LSE is computed and stored in fp32, so there is no precision loss
due to conversion.
Also checking it before OUT helps to understand where the error happens
(first or second GEMM).
…t scales

hdim 128 and 256 and "qr" pipeline pass simple tests.
Almost unchanged, MX-related changes will be done later.
This change breaks the second GEMM with 32x32x64.
K and K scales need shuffling so P+P scales and V+V scales have consistent distributions.
It does not need `* (1 / scale)`.
This requires multiplications for P and then O.
It may not very important for fp8 because almost half of positive values
are in the [0..1] range.
But for fp4 it is necessary bacause only 3 values of 8 are in this
range: 0, 0.5, 1.
Warning! Policy is not yet updated for packed types that likely means:
* LDS size is twice as large as actually needed;
* DRAM/LDS size granularity is not optimal (64 bit, not 128).
By default, there are no fp8/bf8 instances with bias, but if they are
enabled for testing purposes, the bias tests fail.
BiasDataType can be as large as fp32, numeric<BiasDataType>::max() is
too large. Bias is applied to the S matrix, the values of which are
quite small so bias values must have a similar magnitude.
Instead of shuffling K and K scales using DRAM views, it's now done in
BlockGemmMx so the kernel/pipeline don't need to know about such details.
This also solves the issue with masking, bias and alibi because they
work with the layout of the first GEMM's C matrix which is now also
modified accordingly.
Even with padding alignment must be at least 2.
The default invalid_element_value (numeric<e8m0_t>>::zero()) is NaN,
this causes NaNs in MFMA results when inputs are padded.
The algorithm described in "OCP Microscaling Formats (MX) Specification"
have flaws:
 * it requires clamping which is not done automatically by
   v_scalef32_pk_fp8_f32.
 * for fp4 it has high quantization error, e.g. max_abs = 0.99 is
  quantized to 6.0 which is 0.75 after dequantization (especially bad
  for softmax results (0; 1]).
The previous implementation was better than the OCP one but it didn't
use the whole fp4 range for some values losing precision for small
values in the block.
Vector of e8m0_t cannot be constructed from e8m0_t because it has two
conversion operators: from float and e8m0_t::type.
@bartekxk
Copy link
Copy Markdown
Contributor

Part from amd buffer load looks good to me

@ex-rzr
Copy link
Copy Markdown
Contributor Author

ex-rzr commented Mar 9, 2026

@poyenc, @DDEle Hi, do you have other review comments?

Copy link
Copy Markdown
Contributor

@DDEle DDEle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread projects/composablekernel/test/ck_tile/fmha/CMakeLists.txt
@ex-rzr ex-rzr enabled auto-merge (squash) March 11, 2026 09:38
@ex-rzr ex-rzr merged commit 17f7dfc into develop Mar 11, 2026
45 of 47 checks passed
@ex-rzr ex-rzr deleted the users/ex-rzr/ck/fmha-mx-support-on-gfx950 branch March 11, 2026 09:59
assistant-librarian bot pushed a commit to ROCm/composable_kernel that referenced this pull request Mar 11, 2026
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
 gfx950 (#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: ROCm/aiter#2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@ex-rzr
Copy link
Copy Markdown
Contributor Author

ex-rzr commented Mar 11, 2026

My previous manual run of Aiter tests: pipepline

ROCm/aiter#2008 has been update to the latest CK commit, waiting for Aiter's CI...

kokolchin pushed a commit to kokolchin/rocm-libraries that referenced this pull request Mar 11, 2026
…m#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: ROCm/aiter#2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
arai713 pushed a commit that referenced this pull request Mar 11, 2026
## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: ROCm/aiter#2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
DDEle added a commit to ROCm/flash-attention that referenced this pull request Mar 12, 2026
rocking5566 added a commit to ROCm/flash-attention that referenced this pull request Mar 12, 2026
tridao pushed a commit to Dao-AILab/flash-attention that referenced this pull request Mar 18, 2026
…el API changes (#2363)

* update ck

* update ck

* before gpt-oss sink

* gpt-oss sink

* Add missing parameter

* Fix typo

* Update to ROCm/composable_kernel@b09112b

* add -Wno-unknown-warning-option

* Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc)

* Update to ROCm/rocm-libraries@a358a21

---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: Yi DING <andy-ding@outlook.com>
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
…m#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: ROCm/aiter#2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
johannes-graner pushed a commit that referenced this pull request Mar 20, 2026
## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: ROCm/aiter#2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
zhuochenKIDD pushed a commit to zhuochenKIDD/flash-attention that referenced this pull request Mar 25, 2026
…el API changes (Dao-AILab#2363)

* update ck

* update ck

* before gpt-oss sink

* gpt-oss sink

* Add missing parameter

* Fix typo

* Update to ROCm/composable_kernel@b09112b

* add -Wno-unknown-warning-option

* Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc)

* Update to ROCm/rocm-libraries@a358a21

---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: Yi DING <andy-ding@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants