[ROCm] Set ABI version control variable correctly #20800

krzysz00 · 2025-05-14T17:07:33Z

Testing of the load to LDS PR discovered that we were setting a non-existing __oclc_version global instead of the intened __oclc_ABI_version when populating the constants needed for the device libraries.

Testing of the load to LDS PR discovered that we were setting a non-existing `__oclc_version` global instead of the intened `__oclc_ABI_version` when populating the constants needed for the device libraries.

## Summary This PR sets the foundation for using `global_load_lds` instruction to load values from global to LDS memory. The pipeline is as follows: * Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When it sees fit, insert a different attribute (`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the pipeline. * Tagged `linalg.copy` will not be decomposed/tiled until bufferization. * after distributed to threads and bufferization, the tagged `linalg.copy` will then be lowered to a sequence of code responsible for subgroup-coalesced loading op `iree_gpu.global_load_dma`. * `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds` op, which will mapped to corresponding rocdl op. * Disable padding to reduce bank conflict pass because the destination workgroup memory has to be contiguous. ## Lowering `linalg.copy` After bufferization and distribute to threads, tagged `linalg.copy` still exists in the IR: ``` linalg.copy {lowering_config = #iree_gpu.use_global_load_dma} ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>) outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>) ``` Note that this `linalg.copy` is kept in the thread's code. The op itself is then converted into a `for loop`, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading from `tensor<a x b x c>`: * then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so each slice is consecutive. * At this moment, assume row-major, and only tile the outermost dim. * The reason right now we are only dealing with `linalg.copy` emitted by `GPUPromoteMatmulOperands` is that we know the destination is allocated contiguously. * TODO: expand to any memref slices. * given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load: * the chunk indices is the delinearized indices of the input tensor, from: * `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)]`, to * `affine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]` * Assume each subgroup will load `n` values from linearized index `[N_f, N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n : N_f + subgroup_size * iter + (i - 1)` . Then it will be converted to something like the following (in the example, assume `workgroup size = 256`, `subgroup_size = 64`, loading `64x128xi8`): ```miler scf.for %indvar = %c0 to %c32 step %c1 { ;; thread-specific gathering address from global address %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar] %18:2 = affine.delinearize_index %17 into (128, 64) : index, index ;; this iteration's base storing index %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar] %20:2 = affine.delinearize_index %19 into (128, 64) : index, index iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>> } ;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here gpu.barrier ``` ## Dependent PRs: * design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view * upstream changes required: * llvm/llvm-project#133498 * llvm/llvm-project#136405 * llvm/llvm-project#137671 * llvm/llvm-project#137425 * #20800 (review) --------- Signed-off-by: Alan Li <me@alanli.org>

Testing of the load to LDS PR discovered that we were setting a non-existing `__oclc_version` global instead of the intened `__oclc_ABI_version` when populating the constants needed for the device libraries.

## Summary This PR sets the foundation for using `global_load_lds` instruction to load values from global to LDS memory. The pipeline is as follows: * Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When it sees fit, insert a different attribute (`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the pipeline. * Tagged `linalg.copy` will not be decomposed/tiled until bufferization. * after distributed to threads and bufferization, the tagged `linalg.copy` will then be lowered to a sequence of code responsible for subgroup-coalesced loading op `iree_gpu.global_load_dma`. * `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds` op, which will mapped to corresponding rocdl op. * Disable padding to reduce bank conflict pass because the destination workgroup memory has to be contiguous. ## Lowering `linalg.copy` After bufferization and distribute to threads, tagged `linalg.copy` still exists in the IR: ``` linalg.copy {lowering_config = #iree_gpu.use_global_load_dma} ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>) outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>) ``` Note that this `linalg.copy` is kept in the thread's code. The op itself is then converted into a `for loop`, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading from `tensor<a x b x c>`: * then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so each slice is consecutive. * At this moment, assume row-major, and only tile the outermost dim. * The reason right now we are only dealing with `linalg.copy` emitted by `GPUPromoteMatmulOperands` is that we know the destination is allocated contiguously. * TODO: expand to any memref slices. * given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load: * the chunk indices is the delinearized indices of the input tensor, from: * `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)]`, to * `affine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]` * Assume each subgroup will load `n` values from linearized index `[N_f, N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n : N_f + subgroup_size * iter + (i - 1)` . Then it will be converted to something like the following (in the example, assume `workgroup size = 256`, `subgroup_size = 64`, loading `64x128xi8`): ```miler scf.for %indvar = %c0 to %c32 step %c1 { ;; thread-specific gathering address from global address %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar] %18:2 = affine.delinearize_index %17 into (128, 64) : index, index ;; this iteration's base storing index %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar] %20:2 = affine.delinearize_index %19 into (128, 64) : index, index iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>> } ;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here gpu.barrier ``` ## Dependent PRs: * design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view * upstream changes required: * llvm/llvm-project#133498 * llvm/llvm-project#136405 * llvm/llvm-project#137671 * llvm/llvm-project#137425 * #20800 (review) --------- Signed-off-by: Alan Li <me@alanli.org>

[ROCm] Set ABI version control variable correctly

336214b

Testing of the load to LDS PR discovered that we were setting a non-existing `__oclc_version` global instead of the intened `__oclc_ABI_version` when populating the constants needed for the device libraries.

krzysz00 requested a review from kuhar as a code owner May 14, 2025 17:07

lialan approved these changes May 14, 2025

View reviewed changes

lialan mentioned this pull request May 14, 2025

Lower linalg.copy to direct global load #20568

Merged

benvanik approved these changes May 14, 2025

View reviewed changes

benvanik enabled auto-merge (squash) May 14, 2025 17:25

benvanik merged commit cab896d into iree-org:main May 14, 2025
40 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Set ABI version control variable correctly #20800

[ROCm] Set ABI version control variable correctly #20800

Uh oh!

krzysz00 commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ROCm] Set ABI version control variable correctly #20800

[ROCm] Set ABI version control variable correctly #20800

Uh oh!

Conversation

krzysz00 commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants