Lower `linalg.copy` to direct global load by lialan · Pull Request #20568 · iree-org/iree

lialan · 2025-04-17T13:57:25Z

Summary

This PR sets the foundation for using global_load_lds instruction to load values from global to LDS memory. The pipeline is as follows:

Only convert linalg.copy emitted in PromoteGPUMatMulOperands. When it sees fit, insert a different attribute (#iree_gpu.use_global_load_dma) to linalg.copy to tag it along the pipeline.
Tagged linalg.copy will not be decomposed/tiled until bufferization.
after distributed to threads and bufferization, the tagged linalg.copy will then be lowered to a sequence of code responsible for subgroup-coalesced loading op iree_gpu.global_load_dma.
iree_gpu.global_load_dma will be mapped to amdgpu.gather_to_lds op, which will mapped to corresponding rocdl op.
Disable padding to reduce bank conflict pass because the destination workgroup memory has to be contiguous.

Lowering `linalg.copy`

After bufferization and distribute to threads, tagged linalg.copy still exists in the IR:

linalg.copy {lowering_config = #iree_gpu.use_global_load_dma}
  ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>)
  outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>)

Note that this linalg.copy is kept in the thread's code. The op itself is then converted into a for loop, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading from tensor<a x b x c>:

then i-th subgruop will load a sub tensor of size [a/N, b, c], so each slice is consecutive.
- At this moment, assume row-major, and only tile the outermost dim.
- The reason right now we are only dealing with linalg.copy emitted by GPUPromoteMatmulOperands is that we know the destination is allocated contiguously.
- TODO: expand to any memref slices.
given gpu.subgroup_id and gpu.lane_id, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load:
- the chunk indices is the delinearized indices of the input tensor, from:
  - affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)], to
  - affine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]
Assume each subgroup will load n values from linearized index [N_f, N_b], then thread with lane id i will try to load: iter = 0 to n : N_f + subgroup_size * iter + (i - 1) .
Then it will be converted to something like the following (in the example, assume workgroup size = 256, subgroup_size = 64, loading 64x128xi8):

scf.for %indvar = %c0 to %c32 step %c1 {
  ;; thread-specific gathering address from global address
  %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar]
  %18:2 = affine.delinearize_index %17 into (128, 64) : index, index
  ;; this iteration's base storing index
  %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar]
  %20:2 = affine.delinearize_index %19 into (128, 64) : index, index 
  iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>>
}
;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here 
gpu.barrier

Dependent PRs:

krzysz00 · 2025-04-17T18:51:38Z

Side note, I'm still poking at getting the buffer fat pointer to LDS intrinsic set up - it's caught up in bikeshedding on the compiler team

krzysz00

High-level observation: at the point this is being called, shouldn't we know the subgroup size, so that we don't need the subgroup_id op?

Like, you can just look at the workgroup sizes to see which subgroup you're in

krzysz00 · 2025-04-22T20:13:49Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

+//===----------------------------------------------------------------------===//
+
+SmallVector<int64_t>
+UseGlobalLoadDMAAttr::getStaticTilingLevelSizes(unsigned level,


Why's there an unsigned in here?

This is how the LoweringConfigAttrInterface exposes tiling levels. It's up to the backend + lowering config to interpret the level consistently.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp

krzysz00

A few notes, some of which I apparently failed to submit on Friday x.x

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

lialan · 2025-05-13T15:03:53Z

~~Getting this error in CI:~~

iree/runtime/src/iree/hal/drivers/hip/native_executable.c:358: FAILED_PRECONDITION;
HIP driver error 'hipErrorSharedObjectInitFailed' (303): shared object initialization failed;
mismatched target chip? missing/wrong bitcode directory?;
while invoking native function hal.executable.create; while calling import;

~~Needs to wait until llvm/llvm-project#137425 is integrated.~~

fixed.

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp

tests/e2e/linalg/BUILD.bazel

tests/e2e/linalg/lds_matmul.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/test/direct_load.mlir

compiler/src/iree/compiler/Codegen/Common/IREEComprehensiveBufferizePass.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

hanhanW

Thanks, @lialan. It looks better! Here is my final round of the review. (I can skim through the code again after you address the comments.)

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

tests/e2e/linalg/lds_matmul.mlir

krzysz00

Some notes

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

krzysz00 · 2025-05-19T22:48:06Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

+  return numElements;
+}
+
+static bool distributeLinalgCopyToThreads(RewriterBase &rewriter,


Also, this might want to be a LogicalResult?

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

ScottTodd · 2025-05-30T14:53:25Z

compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir

This test is failing on Windows:

https://github.com/iree-org/iree/actions/runs/15320695495/job/43103617380#step:10:338

https://github.com/iree-org/iree/actions/runs/15343607885/job/43174932290#step:10:342

Logs snippet:

108/1604 Test #116: iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir.test ........................................***Failed 1.59 sec -- Testing: 1 tests, 1 workers -- FAIL: IREE :: src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir (1 of 1) ******************** TEST 'IREE :: src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir' FAILED ******************** Exit Code: 1 Command Output (stderr): -- iree-opt --split-input-file --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-global-loads))" C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir | FileCheck C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir # RUN: at line 1 + iree-opt --split-input-file '--pass-pipeline=builtin.module(func.func(iree-codegen-gpu-lower-to-global-loads))' C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir + FileCheck C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir:25:11: error: CHECK: expected string not found in input // CHECK: %[[C4:.*]] = arith.constant 4 : index ^ <stdin>:6:32: note: scanning from here %c1 = arith.constant 1 : index ^ <stdin>:8:2: note: possible intended match here %1 = gpu.subgroup_id : index ^ C:/home/runner/_work/iree/iree/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_global_loads.mlir:55:11: error: CHECK: expected string not found in input // CHECK: %[[C4:.*]] = arith.constant 4 : index ^ <stdin>:28:32: note: scanning from here %c1 = arith.constant 1 : index ^ <stdin>:30:2: note: possible intended match here %1 = gpu.subgroup_id : index ^

Might need to use CHECK-DAG instead of CHECK or change the compiler code to be more deterministic across platforms.

Will submit a fix for it soon.

The usual issue here is evaluation order of function arguments when you're nesting build() calls if that helps

you're probably right @krzysz00 - I suspect it is

scf::ForOp forOp = rewriter.create<scf::ForOp>( loc, /*lb=*/rewriter.create<arith::ConstantIndexOp>(loc, 0), /*ub=*/rewriter.create<arith::ConstantIndexOp>(loc, numCopiesPerThread), /*steps=*/rewriter.create<arith::ConstantIndexOp>(loc, 1));

## Summary This PR sets the foundation for using `global_load_lds` instruction to load values from global to LDS memory. The pipeline is as follows: * Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When it sees fit, insert a different attribute (`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the pipeline. * Tagged `linalg.copy` will not be decomposed/tiled until bufferization. * after distributed to threads and bufferization, the tagged `linalg.copy` will then be lowered to a sequence of code responsible for subgroup-coalesced loading op `iree_gpu.global_load_dma`. * `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds` op, which will mapped to corresponding rocdl op. * Disable padding to reduce bank conflict pass because the destination workgroup memory has to be contiguous. ## Lowering `linalg.copy` After bufferization and distribute to threads, tagged `linalg.copy` still exists in the IR: ``` linalg.copy {lowering_config = #iree_gpu.use_global_load_dma} ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>) outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>) ``` Note that this `linalg.copy` is kept in the thread's code. The op itself is then converted into a `for loop`, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading from `tensor<a x b x c>`: * then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so each slice is consecutive. * At this moment, assume row-major, and only tile the outermost dim. * The reason right now we are only dealing with `linalg.copy` emitted by `GPUPromoteMatmulOperands` is that we know the destination is allocated contiguously. * TODO: expand to any memref slices. * given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load: * the chunk indices is the delinearized indices of the input tensor, from: * `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)]`, to * `affine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]` * Assume each subgroup will load `n` values from linearized index `[N_f, N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n : N_f + subgroup_size * iter + (i - 1)` . Then it will be converted to something like the following (in the example, assume `workgroup size = 256`, `subgroup_size = 64`, loading `64x128xi8`): ```miler scf.for %indvar = %c0 to %c32 step %c1 { ;; thread-specific gathering address from global address %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar] %18:2 = affine.delinearize_index %17 into (128, 64) : index, index ;; this iteration's base storing index %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar] %20:2 = affine.delinearize_index %19 into (128, 64) : index, index iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>> } ;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here gpu.barrier ``` ## Dependent PRs: * design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view * upstream changes required: * llvm/llvm-project#133498 * llvm/llvm-project#136405 * llvm/llvm-project#137671 * llvm/llvm-project#137425 * #20800 (review) --------- Signed-off-by: Alan Li <me@alanli.org>

lialan force-pushed the users/lialan/global_load_lds branch from e4ce145 to 335319d Compare April 22, 2025 02:00

krzysz00 reviewed Apr 22, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch from bbe9565 to f6c7290 Compare April 28, 2025 19:19

lialan changed the title ~~Implement iree_gpu.global_load_dma op~~ Lower linalg.copy to direct global load Apr 28, 2025

lialan force-pushed the users/lialan/global_load_lds branch 5 times, most recently from 275e72b to 97161a8 Compare May 1, 2025 20:05

hanhanW reviewed May 1, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp Outdated Show resolved Hide resolved

krzysz00 reviewed May 5, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch 3 times, most recently from ee462fd to c2b31d4 Compare May 13, 2025 14:27

lialan force-pushed the users/lialan/global_load_lds branch 2 times, most recently from 8152e33 to f7a8bda Compare May 14, 2025 02:21

lialan marked this pull request as ready for review May 14, 2025 19:31

lialan requested review from Groverkss, MaheshRavishankar, antiagainst, kuhar and qedawkins as code owners May 14, 2025 19:31

hanhanW reviewed May 15, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch 2 times, most recently from e48e8b2 to 0b6d30c Compare May 19, 2025 15:04

hanhanW reviewed May 19, 2025

View reviewed changes

krzysz00 reviewed May 19, 2025

View reviewed changes

lialan added 20 commits May 28, 2025 00:33

Fix load size.

e509923

Refactor.

32fbcec

Always use 32bit load

81d48a1

Buffer alloc at GPUPromoteMatmul phase

fff4e2d

Fix check

6658b5b

We cannot buffer alloca at GPUMatMulOperand phase

9702521

Adding a trivial e2e test.

da4cb84

Fix test

f281ca3

Adding a unit test for it.

78d6521

Address comments

ba4a5c1

Cannot handle residual elements case.

aa199fd

Use RewritePattern

0bf2136

adding a new restriction condition.

1f9efa7

Update test file

cf54720

Use AffineLinearizeIndexOp instead of affine exprs.

0fd4724

Linting

859248a

Update according to comments

074f567

Fix deps

3240f57

Update according to comments

d9792db

Address comments

bf808cf

lialan force-pushed the users/lialan/global_load_lds branch from c66f276 to c5206f1 Compare May 28, 2025 01:09

address comments

cb3bad7

lialan force-pushed the users/lialan/global_load_lds branch from c5206f1 to cb3bad7 Compare May 28, 2025 01:36

lialan requested a review from qedawkins May 28, 2025 02:37

qedawkins approved these changes May 28, 2025

View reviewed changes

lialan merged commit 5612a26 into main May 28, 2025
46 checks passed

lialan deleted the users/lialan/global_load_lds branch May 28, 2025 14:55

ScottTodd reviewed May 30, 2025

View reviewed changes

lialan mentioned this pull request May 30, 2025

[NFC] Fix test issues on Windows. #20957

Merged

Conversation

lialan commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Lowering linalg.copy

Dependent PRs:

Uh oh!

krzysz00 commented Apr 17, 2025

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

krzysz00 Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

qedawkins May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lialan commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krzysz00 May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ScottTodd May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lialan May 30, 2025

Choose a reason for hiding this comment

Uh oh!

krzysz00 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lialan commented Apr 17, 2025 •

edited

Loading

Lowering `linalg.copy`

lialan commented May 13, 2025 •

edited

Loading