Skip to content

Conversation

@nsingh-habana
Copy link

No description provided.

@nsingh-habana nsingh-habana marked this pull request as draft October 24, 2025 02:48
@nsingh-habana nsingh-habana changed the title Integrate new mma_atoms and copy_atoms into bmg_grouped_gemm_fp8 New mma_atoms and copy_atoms in bmg_grouped_gemm_fp8 Oct 24, 2025
@nsingh-habana nsingh-habana force-pushed the grouped_gemm_fp8_new_atoms branch from 0959d75 to 1d6d2f7 Compare October 24, 2025 08:39
Comment on lines +217 to +218
Tensor tCrA_fp16 = make_fragment_like<half_t>(tCrA);
Tensor tCrB_fp16 = make_fragment_like<half_t>(tCrB);
Copy link

@sanchitintel sanchitintel Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rolandschulz & @petercad, please advise whether such a redesign would make sense -

For FP8xFP8 GEMM on Xe2, FP8 is converted into FP16.
reorders in the new API allow multiple dtypes to share the same GEMM (MMA collectives) code. They're no-ops if dtype conversion is not needed. So, perhaps, they could share the same code? I could be wrong, but it seems this is (partly) what @petercad had in mind regarding the rearchitecture.

While it's indeed possible to use if constexpr with compile-time evaluated expressions to add multiple dtypes' GEMMs support in the same file, reorders seem to make things even simpler. It was earlier decided to choose readability & debuggability over reducing code duplication, which is why the legacy FP8xFP8 GEMM currently has a separate implementation with duplicated code.

@nsingh-habana, please also share your thoughts on it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants