Fix kernel map collision on MGPU context by Micky774 · Pull Request #2401 · ROCm/aiter

Micky774 · 2026-03-20T17:29:07Z

Motivation

Note: This is not necessary with #2221 so I do think ideally that gets merged instead. This is just required as long as we use the standard hip module load.

In TE's JAX integration, the MGPU tests fail due to multiple devices registering separate kernels in the impl_ptr_map, leading to devices attempting to launch kernels that may have come from other devices resulting in a mismatched device ordinal launch error.

Technical Details

Adds device ID to map key

Test Plan

Verify TE JAX integration tests pass

Test Result

TE JAX integration tests do indeed pass

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-20T17:29:20Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2401 --add-label <label>

Copilot

Pull request overview

Fixes multi-GPU (MGPU) kernel-cache collisions in the FMHA v3 ASM path by ensuring cached kernel modules are keyed per device, preventing cross-device module reuse that can trigger “mismatched device ordinal launch” failures (notably in TE’s JAX integration).

Changes:

Add current HIP device ordinal to the FMHA forward kernel cache key.
Add current HIP device ordinal to the FMHA backward kernel cache key (pre/dqdkdv/post kernels).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`csrc/cpp_itfs/mha_fwd.cu`	Prefixes the forward ASM kernel cache key with the current device id to avoid cross-device cache collisions.
`csrc/cpp_itfs/mha_bwd.cu`	Prefixes the backward ASM kernel cache key with the current device id for the same MGPU safety.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T17:34:22Z

+    // Include device ID in cache key so each GPU gets its own loaded module
+    int current_device;
+    HIP_CALL(hipGetDevice(&current_device));
+    std::string dev_prefix = std::to_string(current_device) + ":";
+
    const auto& cfg     = it->second;
    const char* name    = cfg.knl_name.c_str();
    std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);
+    std::string key     = dev_prefix + name;


get_kernel_co_name() relies on is_mi308_device(), which (via get_pci_chip_id() in aiter_hip_common.h) caches the PCI chip id in a function-static value based on whichever device was current on first call. In a multi-GPU process with heterogeneous devices, this can pick the wrong hsaco subfolder (MI300 vs MI308) for subsequent devices. Consider making the MI308/MI300 selection device-specific (e.g., keyed by hipGetDevice() or by querying properties for the current device each time / caching per device id).

Suggested change

// Include device ID in cache key so each GPU gets its own loaded module

int current_device;

HIP_CALL(hipGetDevice(&current_device));

std::string dev_prefix = std::to_string(current_device) + ":";

const auto& cfg = it->second;

const char* name = cfg.knl_name.c_str();

std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);

std::string key = dev_prefix + name;

// Include device ID and co_name in cache key so each GPU/variant gets its own loaded module

int current_device;

HIP_CALL(hipGetDevice(&current_device));

const auto& cfg = it->second;

const char* name = cfg.knl_name.c_str();

std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);

std::string key = std::to_string(current_device) + ":" + co_name + ":" + name;

Copilot · 2026-03-20T17:34:23Z

    static std::unordered_map<std::string, std::unique_ptr<AiterAsmKernel>> impl_ptr_map;

+    // Include device ID in cache key so each GPU gets its own loaded module
+    int current_device;
+    HIP_CALL(hipGetDevice(&current_device));
+    std::string dev_prefix = std::to_string(current_device) + ":";


impl_ptr_map is a process-wide static std::unordered_map that is mutated via emplace() without any synchronization. In MGPU setups it’s common for different host threads to run on different devices concurrently, and concurrent inserts/reads on unordered_map are undefined behavior (can crash/corrupt the cache). Consider making this cache thread_local (as in fmha_fwd_v3) or protecting all accesses with a mutex (or another thread-safe cache).

Micky774 · 2026-03-24T17:30:09Z

Note: This is not necessary with #2221 so I do think ideally that gets merged instead. This is just required as long as we use the standard hip module load.

Fix mgpu kernel map collision

247995d

Micky774 requested review from a team and Copilot March 20, 2026 17:29

Copilot started reviewing on behalf of Micky774 March 20, 2026 17:30 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Micky774 mentioned this pull request Mar 20, 2026

Update AITER subcommit and refactor internal AITER/CK FA API usage ROCm/TransformerEngine#446

Merged

13 tasks

valarLip requested a review from JaxChen29 March 21, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix kernel map collision on MGPU context#2401

Fix kernel map collision on MGPU context#2401
Micky774 wants to merge 1 commit intomainfrom
zain/um-mgpu-collision

Micky774 commented Mar 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Micky774 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Micky774 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Mar 20, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Micky774 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Micky774 commented Mar 20, 2026 •

edited

Loading