Skip to content

Conversation

@krzysz00
Copy link
Contributor

Testing of the load to LDS PR discovered that we were setting a non-existing __oclc_version global instead of the intened __oclc_ABI_version when populating the constants needed for the device libraries.

Testing of the load to LDS PR discovered that we were setting a
non-existing `__oclc_version` global instead of the intened
`__oclc_ABI_version` when populating the constants needed for the
device libraries.
@krzysz00 krzysz00 requested a review from kuhar as a code owner May 14, 2025 17:07
@benvanik benvanik enabled auto-merge (squash) May 14, 2025 17:25
@benvanik benvanik merged commit cab896d into iree-org:main May 14, 2025
40 of 41 checks passed
lialan added a commit that referenced this pull request May 28, 2025
## Summary
This PR sets the foundation for using `global_load_lds` instruction to
load values from global to LDS memory. The pipeline is as follows:
* Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When
it sees fit, insert a different attribute
(`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the
pipeline.
* Tagged `linalg.copy` will not be decomposed/tiled until bufferization.
* after distributed to threads and bufferization, the tagged
`linalg.copy` will then be lowered to a sequence of code responsible for
subgroup-coalesced loading op `iree_gpu.global_load_dma`.
* `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds`
op, which will mapped to corresponding rocdl op.
* Disable padding to reduce bank conflict pass because the destination
workgroup memory has to be contiguous.

## Lowering `linalg.copy`
After bufferization and distribute to threads, tagged `linalg.copy`
still exists in the IR:
```
linalg.copy {lowering_config = #iree_gpu.use_global_load_dma}
  ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>)
  outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>)
```

Note that this `linalg.copy` is kept in the thread's code. The op itself
is then converted into a `for loop`, in which subgroup of threads loads
coalesced chunk of values. For example, assume there are N subgroups
loading from `tensor<a x b x c>`:
* then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so
each slice is consecutive.
	* At this moment, assume row-major, and only tile the outermost dim.
* The reason right now we are only dealing with `linalg.copy` emitted by
`GPUPromoteMatmulOperands` is that we know the destination is allocated
contiguously.
	* TODO: expand to any memref slices.
* given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the
consecutive data chunk the subgroup the thread belongs to is responsible
to load:
* the chunk indices is the delinearized indices of the input tensor,
from:
* `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) /
num_subgroups)]`, to
* `affine.delinearize_index[(gpu.subgroup_id + 1) *
(num_elems_of(tensor) / num_subgroups) - 1]`
* Assume each subgroup will load `n` values from linearized index `[N_f,
N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n :
N_f + subgroup_size * iter + (i - 1)` .
Then it will be converted to something like the following (in the
example, assume `workgroup size = 256`, `subgroup_size = 64`, loading
`64x128xi8`):
```miler
scf.for %indvar = %c0 to %c32 step %c1 {
  ;; thread-specific gathering address from global address
  %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar]
  %18:2 = affine.delinearize_index %17 into (128, 64) : index, index
  ;; this iteration's base storing index
  %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar]
  %20:2 = affine.delinearize_index %19 into (128, 64) : index, index 
  iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>>
}
;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here 
gpu.barrier
```

## Dependent PRs:
* design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view
* upstream changes required: 
  * llvm/llvm-project#133498
  * llvm/llvm-project#136405
  * llvm/llvm-project#137671
  * llvm/llvm-project#137425
  * #20800 (review)

---------

Signed-off-by: Alan Li <me@alanli.org>
AWoloszyn pushed a commit that referenced this pull request Dec 1, 2025
Testing of the load to LDS PR discovered that we were setting a
non-existing `__oclc_version` global instead of the intened
`__oclc_ABI_version` when populating the constants needed for the device
libraries.
AWoloszyn pushed a commit that referenced this pull request Dec 1, 2025
## Summary
This PR sets the foundation for using `global_load_lds` instruction to
load values from global to LDS memory. The pipeline is as follows:
* Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When
it sees fit, insert a different attribute
(`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the
pipeline.
* Tagged `linalg.copy` will not be decomposed/tiled until bufferization.
* after distributed to threads and bufferization, the tagged
`linalg.copy` will then be lowered to a sequence of code responsible for
subgroup-coalesced loading op `iree_gpu.global_load_dma`.
* `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds`
op, which will mapped to corresponding rocdl op.
* Disable padding to reduce bank conflict pass because the destination
workgroup memory has to be contiguous.

## Lowering `linalg.copy`
After bufferization and distribute to threads, tagged `linalg.copy`
still exists in the IR:
```
linalg.copy {lowering_config = #iree_gpu.use_global_load_dma}
  ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>)
  outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>)
```

Note that this `linalg.copy` is kept in the thread's code. The op itself
is then converted into a `for loop`, in which subgroup of threads loads
coalesced chunk of values. For example, assume there are N subgroups
loading from `tensor<a x b x c>`:
* then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so
each slice is consecutive.
	* At this moment, assume row-major, and only tile the outermost dim.
* The reason right now we are only dealing with `linalg.copy` emitted by
`GPUPromoteMatmulOperands` is that we know the destination is allocated
contiguously.
	* TODO: expand to any memref slices.
* given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the
consecutive data chunk the subgroup the thread belongs to is responsible
to load:
* the chunk indices is the delinearized indices of the input tensor,
from:
* `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) /
num_subgroups)]`, to
* `affine.delinearize_index[(gpu.subgroup_id + 1) *
(num_elems_of(tensor) / num_subgroups) - 1]`
* Assume each subgroup will load `n` values from linearized index `[N_f,
N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n :
N_f + subgroup_size * iter + (i - 1)` .
Then it will be converted to something like the following (in the
example, assume `workgroup size = 256`, `subgroup_size = 64`, loading
`64x128xi8`):
```miler
scf.for %indvar = %c0 to %c32 step %c1 {
  ;; thread-specific gathering address from global address
  %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar]
  %18:2 = affine.delinearize_index %17 into (128, 64) : index, index
  ;; this iteration's base storing index
  %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar]
  %20:2 = affine.delinearize_index %19 into (128, 64) : index, index 
  iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>>
}
;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here 
gpu.barrier
```

## Dependent PRs:
* design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view
* upstream changes required: 
  * llvm/llvm-project#133498
  * llvm/llvm-project#136405
  * llvm/llvm-project#137671
  * llvm/llvm-project#137425
  * #20800 (review)

---------

Signed-off-by: Alan Li <me@alanli.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants