Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions csrc/cpp_itfs/mha_bwd.cu
Original file line number Diff line number Diff line change
Expand Up @@ -371,15 +371,21 @@ float fmha_v3_bwd(mha_bwd_args a, const ck_tile::stream_config& s)
AiterAsmKernel* impl_ptr_post = nullptr;
static std::unordered_map<std::string, std::unique_ptr<AiterAsmKernel>> impl_ptr_map;

// Include device ID in cache key so each GPU gets its own loaded module
int current_device;
HIP_CALL(hipGetDevice(&current_device));
std::string dev_prefix = std::to_string(current_device) + ":";
Comment on lines 372 to +377
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl_ptr_map is a process-wide static std::unordered_map that is mutated via emplace() without any synchronization. In MGPU setups it’s common for different host threads to run on different devices concurrently, and concurrent inserts/reads on unordered_map are undefined behavior (can crash/corrupt the cache). Consider making this cache thread_local (as in fmha_fwd_v3) or protecting all accesses with a mutex (or another thread-safe cache).

Copilot uses AI. Check for mistakes.

auto it_pre = pre_cfgs->find(pre_kernel);
if(it_pre != pre_cfgs->end())
{
const auto& cfg = it_pre->second;
const char* name = cfg.knl_name.c_str();
const char* co_name = cfg.co_name.c_str();
ts_odo = cfg.ts;
std::string key = dev_prefix + name;

auto result = impl_ptr_map.emplace(name, nullptr);
auto result = impl_ptr_map.emplace(key, nullptr);
if(result.second)
{
result.first->second = std::make_unique<AiterAsmKernel>(name, co_name);
Expand All @@ -399,8 +405,9 @@ float fmha_v3_bwd(mha_bwd_args a, const ck_tile::stream_config& s)
const char* name = cfg.knl_name.c_str();
const char* co_name = cfg.co_name.c_str();
ts_kv = cfg.ts;
std::string key = dev_prefix + name;

auto result = impl_ptr_map.emplace(name, nullptr);
auto result = impl_ptr_map.emplace(key, nullptr);
if(result.second)
{
result.first->second = std::make_unique<AiterAsmKernel>(name, co_name);
Expand All @@ -422,8 +429,9 @@ float fmha_v3_bwd(mha_bwd_args a, const ck_tile::stream_config& s)
const char* name = cfg.knl_name.c_str();
const char* co_name = cfg.co_name.c_str();
ts_dq = cfg.ts;
std::string key = dev_prefix + name;

auto result = impl_ptr_map.emplace(name, nullptr);
auto result = impl_ptr_map.emplace(key, nullptr);
if(result.second)
{
result.first->second = std::make_unique<AiterAsmKernel>(name, co_name);
Expand Down
8 changes: 7 additions & 1 deletion csrc/cpp_itfs/mha_fwd.cu
Original file line number Diff line number Diff line change
Expand Up @@ -242,11 +242,17 @@ float fmha_fwd_v3(mha_fwd_args a, const ck_tile::stream_config& s)
static thread_local std::unordered_map<std::string, std::unique_ptr<AiterAsmKernel>>
impl_ptr_map;

// Include device ID in cache key so each GPU gets its own loaded module
int current_device;
HIP_CALL(hipGetDevice(&current_device));
std::string dev_prefix = std::to_string(current_device) + ":";

const auto& cfg = it->second;
const char* name = cfg.knl_name.c_str();
std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);
std::string key = dev_prefix + name;
Comment on lines +245 to +253
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_kernel_co_name() relies on is_mi308_device(), which (via get_pci_chip_id() in aiter_hip_common.h) caches the PCI chip id in a function-static value based on whichever device was current on first call. In a multi-GPU process with heterogeneous devices, this can pick the wrong hsaco subfolder (MI300 vs MI308) for subsequent devices. Consider making the MI308/MI300 selection device-specific (e.g., keyed by hipGetDevice() or by querying properties for the current device each time / caching per device id).

Suggested change
// Include device ID in cache key so each GPU gets its own loaded module
int current_device;
HIP_CALL(hipGetDevice(&current_device));
std::string dev_prefix = std::to_string(current_device) + ":";
const auto& cfg = it->second;
const char* name = cfg.knl_name.c_str();
std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);
std::string key = dev_prefix + name;
// Include device ID and co_name in cache key so each GPU/variant gets its own loaded module
int current_device;
HIP_CALL(hipGetDevice(&current_device));
const auto& cfg = it->second;
const char* name = cfg.knl_name.c_str();
std::string co_name = get_kernel_co_name(cfg.co_name, arch_id);
std::string key = std::to_string(current_device) + ":" + co_name + ":" + name;

Copilot uses AI. Check for mistakes.

auto result = impl_ptr_map.emplace(name, nullptr);
auto result = impl_ptr_map.emplace(key, nullptr);
if(result.second)
{
result.first->second = std::make_unique<AiterAsmKernel>(name, co_name.c_str());
Expand Down
Loading