Skip to content

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#4289

Closed
assistant-librarian[bot] wants to merge 6 commits into
developfrom
import/develop/ROCm_composable_kernel/pr-3593
Closed

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#4289
assistant-librarian[bot] wants to merge 6 commits into
developfrom
import/develop/ROCm_composable_kernel/pr-3593

Conversation

@assistant-librarian
Copy link
Copy Markdown
Contributor

Summary

This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the DeviceGemmDl path which has some edge cases that aren't well-tested.

Note: This is a draft PR that will be updated as we discover more issues.


Fix 1: Buffer Load OOB Crash with Large K and Small M

Problem

DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).

The crash occurs in gridwise_gemm_dl_v1r3.hpp during block_sync_lds() after an invalid buffer load.

Root Cause

CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default (set to 0).

Without the offset trick:

  1. Buffer loads execute unconditionally
  2. Bounds check happens after the load returns
  3. If the load address is unmapped, the GPU crashes before the bounds check matters

With the offset trick enabled:

  1. Invalid coordinates get 0x80000000 added to offset
  2. This flags the load as OOB to hardware
  3. Hardware safely returns zero instead of accessing unmapped memory

Solution

  1. include/ck/ck.hpp: Enable CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK by default
  2. include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Use coordinate_has_valid_offset() instead of coordinate_has_valid_offset_assuming_visible_index_is_valid() for full bounds validation

Verification

INT8 GEMM tests pass for:

  • M=1 (single-row decode)
  • K up to 14336
  • FFN projection dimensions for Qwen2.5 and Llama3

Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)

Problem

When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.

The element op contract is: (E& e, const C& c, const D& d...) where:

  • E = FloatC (final output type, e.g., float)
  • C = FloatAcc (accumulator type, e.g., int32_t)

Root Cause

Original code at lines 615-618 used generate_tie() returning the same c_thread_buf for both E& and C&:

auto dst_data_refs = generate_tie(
    [&](auto) -> auto& { return c_thread_buf(Number<c_offset>{}); },
    Number<2>{});

This causes:

  1. Type mismatch when FloatAcc != FloatC (element op expects float& for e, gets int32_t&)
  2. Compile errors with strictly-typed element ops
  3. Undefined behavior during ThreadwiseTensorSliceTransfer which type-puns FloatAcc bits as FloatC

This bug has existed since the file was created in December 2022 (PR #517).

Solution

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:

  1. Introduce separate e_thread_buf<FloatC> for element op output
  2. Pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf using tie()
  3. Transfer e_thread_buf (not c_thread_buf) to global memory

Minimal Repro

See original PR #3565 for compile-time repro that demonstrates the type mismatch.


Environment

  • GPU: gfx906 (MI50)
  • ROCm: 7.1.1
  • Use case: INT8×INT8→INT32 GEMM with FP32 output for LLM inference

🔁 Imported from ROCm/composable_kernel#3593
🧑‍💻 Originally authored by @dbsanfte

dbsanfte and others added 5 commits January 16, 2026 11:32
Problem:
DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (M=1 decode case).

Root cause:
CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default.
Without this, invalid buffer loads execute and crash before bounds checking
can prevent them.

Solution:
1. Enable the OOB offset trick (0x80000000) so invalid coordinates safely
   return zero instead of accessing unmapped memory
2. Use full coordinate_has_valid_offset() check instead of the
   _assuming_visible_index_is_valid variant for proper K bounds validation

Verified with INT8 GEMM tests: M=1 decode, K=14336, FFN projections.
Problem:
When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with
FP32 output scaling), the CDE element op is invoked with wrong storage types.

The element op contract is: (E& e, const C& c, const D& d...) where:
- E = FloatC (final output type, e.g., float)
- C = FloatAcc (accumulator type, e.g., int32_t)

Original code used generate_tie() returning the same c_thread_buf for both
E& and C&, which:
1. Violates the element op signature when types differ
2. Causes compile errors with strictly-typed element ops
3. Results in undefined behavior during ThreadwiseTensorSliceTransfer

Solution:
Introduce separate e_thread_buf<FloatC> for element op output, pass
(E& e) from e_thread_buf and (const C& c) from c_thread_buf, then
transfer e_thread_buf to global memory.

Bug has existed since the file was created in December 2022 (PR #517).
- Add CK_GFX906_DEBUG macro for conditional debug output
- Log GEMM parameters (M, N, K, strides) for gfx906 devices
- Track which device GEMM variants are being invoked
- Helps diagnose launch bounds and occupancy issues on older GCN
- Comment out always-on std::cout debug spam in device_gemm_multiple_d_dl.hpp
- Add optional CK_DEBUG_KERNEL-gated logging in gridwise_gemm_dl_v1r3.hpp
- Fixes console spam on every GEMM call for gfx906 devices
@assistant-librarian assistant-librarian Bot added the external contribution Code contribution from users community.. label Feb 3, 2026
@illsilin
Copy link
Copy Markdown
Contributor

Hey @dbsanfte, are you still working on this or can we close it?

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

  • Add new commits
  • Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

@github-actions github-actions Bot added the Stale PR has no activity for 25+ days label May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically closed due to inactivity (30 days with no updates).

If you'd like to continue working on this, feel free to reopen the PR or create a new one.

@github-actions github-actions Bot closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external contribution Code contribution from users community.. imported pr project: composablekernel Stale PR has no activity for 25+ days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants