[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368
Merged
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950#4368
Conversation
With "uf" results are too close to zero, so with high tolerance even completely incorrect values pass the check.
LSE is computed and stored in fp32, so there is no precision loss due to conversion. Also checking it before OUT helps to understand where the error happens (first or second GEMM).
…t scales hdim 128 and 256 and "qr" pipeline pass simple tests.
Almost unchanged, MX-related changes will be done later.
This change breaks the second GEMM with 32x32x64.
K and K scales need shuffling so P+P scales and V+V scales have consistent distributions.
It does not need `* (1 / scale)`.
This requires multiplications for P and then O. It may not very important for fp8 because almost half of positive values are in the [0..1] range. But for fp4 it is necessary bacause only 3 values of 8 are in this range: 0, 0.5, 1.
Warning! Policy is not yet updated for packed types that likely means: * LDS size is twice as large as actually needed; * DRAM/LDS size granularity is not optimal (64 bit, not 128).
By default, there are no fp8/bf8 instances with bias, but if they are enabled for testing purposes, the bias tests fail. BiasDataType can be as large as fp32, numeric<BiasDataType>::max() is too large. Bias is applied to the S matrix, the values of which are quite small so bias values must have a similar magnitude.
Instead of shuffling K and K scales using DRAM views, it's now done in BlockGemmMx so the kernel/pipeline don't need to know about such details. This also solves the issue with masking, bias and alibi because they work with the layout of the first GEMM's C matrix which is now also modified accordingly.
Even with padding alignment must be at least 2.
The default invalid_element_value (numeric<e8m0_t>>::zero()) is NaN, this causes NaNs in MFMA results when inputs are padded.
The algorithm described in "OCP Microscaling Formats (MX) Specification" have flaws: * it requires clamping which is not done automatically by v_scalef32_pk_fp8_f32. * for fp4 it has high quantization error, e.g. max_abs = 0.99 is quantized to 6.0 which is 0.75 after dequantization (especially bad for softmax results (0; 1]). The previous implementation was better than the OCP one but it didn't use the whole fp4 range for some values losing precision for small values in the block.
Vector of e8m0_t cannot be constructed from e8m0_t because it has two conversion operators: from float and e8m0_t::type.
bartekxk
approved these changes
Feb 27, 2026
Contributor
|
Part from amd buffer load looks good to me |
Contributor
Author
DDEle
approved these changes
Mar 10, 2026
Contributor
There was a problem hiding this comment.
Otherwise looks good.
Triggered aiter tests:

http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/users%2Fex-rzr%2Fck%2Ffmha-mx-support-on-gfx950/6/pipeline
assistant-librarian bot
pushed a commit
to ROCm/composable_kernel
that referenced
this pull request
Mar 11, 2026
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Contributor
Author
|
My previous manual run of Aiter tests: pipepline ROCm/aiter#2008 has been update to the latest CK commit, waiting for Aiter's CI... |
kokolchin
pushed a commit
to kokolchin/rocm-libraries
that referenced
this pull request
Mar 11, 2026
…m#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
arai713
pushed a commit
that referenced
this pull request
Mar 11, 2026
## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
DDEle
added a commit
to ROCm/flash-attention
that referenced
this pull request
Mar 12, 2026
1 task
rocking5566
added a commit
to ROCm/flash-attention
that referenced
this pull request
Mar 12, 2026
tridao
pushed a commit
to Dao-AILab/flash-attention
that referenced
this pull request
Mar 18, 2026
…el API changes (#2363) * update ck * update ck * before gpt-oss sink * gpt-oss sink * Add missing parameter * Fix typo * Update to ROCm/composable_kernel@b09112b * add -Wno-unknown-warning-option * Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc) * Update to ROCm/rocm-libraries@a358a21 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: Yi DING <andy-ding@outlook.com>
jovanau
pushed a commit
to jovanau/rocm-libraries
that referenced
this pull request
Mar 19, 2026
…m#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
johannes-graner
pushed a commit
that referenced
this pull request
Mar 20, 2026
## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: ROCm/aiter#2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
zhuochenKIDD
pushed a commit
to zhuochenKIDD/flash-attention
that referenced
this pull request
Mar 25, 2026
…el API changes (Dao-AILab#2363) * update ck * update ck * before gpt-oss sink * gpt-oss sink * Add missing parameter * Fix typo * Update to ROCm/composable_kernel@b09112b * add -Wno-unknown-warning-option * Update to ROCm/rocm-libraries#4368 (ROCm/rocm-libraries@17f7dfc) * Update to ROCm/rocm-libraries@a358a21 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: Yi DING <andy-ding@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
The microscaling is used when quant scale mode is
BlockAttentionQuantScaleEnum::MXandQ/K/P/VDataTypeare fp8/bf8/fp4.Supported features:
Aiter PR with new API args: ROCm/aiter#2008
Test Plan
Test Result
The tests must pass.
Submission Checklist