-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[MLIR][XeGPU] Update XeGPU create_tdesc, update_offset, load, store and prefetch. #154653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
4e9d25f
2e985ce
46857e3
1077b3a
af57f45
daf1839
98f2caa
68492e2
9555b06
95dadb9
269ff58
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -70,28 +70,32 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface | |||||
| future). Elements in the subview continuous in each dimension. It encodes the | ||||||
| following important information for supporting Intel hardware features: | ||||||
|
|
||||||
| * source: an object representing (starting address/pointer of) a memory region. | ||||||
| Arguments: | ||||||
| - `source`: an object representing (starting address/pointer of) a memory region. | ||||||
| It can be either a memref object, or simply a pointer represented by uint64_t type. | ||||||
| For the case of dynamic memrefs or pointer, the shape and layout information of the | ||||||
| memory region should be explicitly passed via `shape` and `strides` parameters. | ||||||
|
|
||||||
| * offsets: index values represents offsets from the "source" at the each dimension | ||||||
| - `offsets`: index values represents offsets from the "source" at the each dimension | ||||||
| at which the subview of the target memory will be created. It is encoded via | ||||||
| "offsets" and "const_offsets", such that it can accept various forms, such as, | ||||||
| operands (e.g., [%c0, %c]) and attributes (e.g., [2, 4]). | ||||||
|
|
||||||
| * shape: the shape information of the memory region pointed by the "source". It is | ||||||
| - `shape`: the shape information of the memory region pointed by the "source". It is | ||||||
| typically encoded via the MemRefType of the source, e.g., memref<4096x4096xf16>. | ||||||
| But if "source" is simply a pointer represented as uint64_t type, or a memref | ||||||
| type without shape information e.g., memref<?x?xf16>, the shape information has | ||||||
| to be explicitly passed via the "shape" and "const_shape" arguments. | ||||||
|
|
||||||
| * strides: the strides of the memory region pointed by the "source". Similar to shape, | ||||||
| - `strides`: the strides of the memory region pointed by the "source". Similar to shape, | ||||||
| it is typically encoded via the MemRefType of the source too. But if "source" is | ||||||
| simply a pointer represented as uint64_t type, or a memref type without shape | ||||||
| information e.g., memref<?x?xf16>, the strides information has to be explicitly | ||||||
| passed via the "strides" and "const_strides" argument. | ||||||
|
|
||||||
| Results: | ||||||
| - `res`: nd tensor descriptor | ||||||
|
|
||||||
| Example 1 (suppose the tensor shape inferred by the compiler is 8x16): | ||||||
| ```mlir | ||||||
| %0 = memref.alloc() : memref<1024x1024xf32> | ||||||
|
|
@@ -500,12 +504,17 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> { | |||||
| (scattered) subviews, allowing each work-item in a subgroup specifying their own offset. | ||||||
| It accepts the following parameters: | ||||||
|
|
||||||
| * source: a 1D memref or pointer (uint64_t) represents the flattened memory object. | ||||||
| * offsets: a vector containing offsets of each access point. Its size | ||||||
| Arguments: | ||||||
| - `source`: a 1D memref or pointer (i64, i32, ui64, ui32) represents the flattened | ||||||
| memory object. | ||||||
| - `offsets`: a vector containing offsets of each access point. Its size | ||||||
| is fixed to the hardware supportted subgroup size, e.g., 16 on PVC, | ||||||
| implying each element in the vector corresponds to a work-item (SIMT lane) | ||||||
| in the subgroup. | ||||||
|
|
||||||
| Results: | ||||||
| - `res`: scattered tensor descriptor | ||||||
|
|
||||||
| The first dimension of the result TensorDesc corresponds to work-items, so it should | ||||||
| match the dimension of offsets. It may also has a second dimension corresponding to | ||||||
| the chunk_size if the chunk size is larger than 1. | ||||||
|
|
@@ -536,8 +545,8 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> { | |||||
| ``` | ||||||
| }]; | ||||||
|
|
||||||
| let arguments = (ins XeGPU_BaseAddrType: $source, | ||||||
| XeGPU_OffsetType: $offsets); | ||||||
| let arguments = (ins XeGPU_GatherScatterBaseAddrType:$source, | ||||||
| XeGPU_OffsetType:$offsets); | ||||||
| let results = (outs XeGPU_TensorDesc:$TensorDesc); | ||||||
|
|
||||||
| let builders = [ | ||||||
|
|
@@ -595,6 +604,16 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> { | |||||
| As compared to prefetch_nd, which works on non-scattered TensorDesc, | ||||||
| it works on scattered TensorDesc instead. | ||||||
|
|
||||||
| Arguments: | ||||||
| - `source`: represents the memory region to be loaded from, which can be either a | ||||||
| tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). | ||||||
| In case of tensor_desc, offsets come from the producer create_tdesc op. | ||||||
| tensor_desc cannot be used in SIMT mode. | ||||||
| - `offsets`: represents offsets from source. required if `source` in not a TensorDescType. | ||||||
| offsets is a vector of `index` type and vector length is either the subgroup size | ||||||
| or 1 in SIMT mode. scalar offset is also valid for SIMT mode. | ||||||
| - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache. | ||||||
|
|
||||||
| Example 1: | ||||||
| ```mlir | ||||||
| xegpu.prefetch %tdesc {l1_hint = #xegpu.cache_hint<cached>, | ||||||
|
|
@@ -606,7 +625,7 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> { | |||||
| Example 2: | ||||||
| A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. | ||||||
| It combines "create scattered TensorTdesc" and "prefetch with scattered TensorTdesc". | ||||||
| The source operand could be a raw pointer (uint64_t). | ||||||
| The source operand could be a raw pointer (ui64, ui32, i64, i32). | ||||||
| Please refer to create_tdesc for the restriction of memref. | ||||||
| ```mlir | ||||||
| %a = memref.alloc() : memref<1024xf32> | ||||||
|
|
@@ -617,13 +636,22 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> { | |||||
| : memref<1024xf32>, vector<4xindex> | ||||||
| ``` | ||||||
|
|
||||||
| Example 3 (SIMT mode): | ||||||
| SIMT mode only accepts the offsets variant. | ||||||
| ```mlir | ||||||
| xegpu.prefetch %0[%1] {l1_hint = #xegpu.cache_hint<cached>, | ||||||
| l2_hint = #xegpu.cache_hint<cached>, | ||||||
| l3_hint = #xegpu.cache_hint<cached>} | ||||||
| : memref<256xf32>, vector<1xindex> | ||||||
| ``` | ||||||
|
|
||||||
| }]; | ||||||
|
|
||||||
| let arguments = (ins XeGPU_GatherScatterSourceType: $source, | ||||||
| Optional<XeGPU_OffsetType>: $offsets, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint); | ||||||
| let arguments = (ins XeGPU_GatherScatterSourceType:$source, | ||||||
| Optional<AnyTypeOf<[XeGPU_OffsetType, Index]>>:$offsets, | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not move Index to XeGPU_OffsetType?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scalar offset is provided for simplifying layout distribution to SIMT.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will supporting both |
||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint); | ||||||
|
|
||||||
| let extraClassDeclaration = extraBaseClassDeclaration # [{ | ||||||
| Type getSourceType() { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you help to get rid of getTensorDesc() and getTensorDescType() methods if XeGPU_GatherScatterSourceType doesn't support TensorDesc.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned above, |
||||||
|
|
@@ -670,8 +698,26 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> { | |||||
| The mask operand masks out memory access so that it is safe to pass out-of-boundary | ||||||
| addresses/offsets as long as they are masked. It applies to slots of SIMD lanes. | ||||||
|
|
||||||
| In SIMT mode, the result vector represents the data to be loaded by each work-item. | ||||||
| Each work-item recieves a `chunk_size` number of elements. | ||||||
| In SIMT mode, the result is a 1D vector that represents the data to be loaded by | ||||||
| each work-item. If size is not 1, size should be equal to the chunk size, | ||||||
|
|
||||||
| Arguments: | ||||||
| - `source`: represents the memory region to be loaded from, which can be either a | ||||||
| tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). | ||||||
| In case of tensor_desc, offsets come from the producer create_tdesc op. | ||||||
| tensor_desc cannot be used in SIMT mode. | ||||||
| - `offsets`: represents offsets from source. required if `source` in not a TensorDescType. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. vector<1xindex> can auto convert to index by materialization cast. |
||||||
| offsets is a vector of `index` type and vector length is either the subgroup size | ||||||
| or 1 in SIMT mode. scalar offset is also valid for SIMT mode. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the scalar offset case needed? I would imagine the offsets will always be a vector if coming from SG/WG level. So I don't think this is required (but nice to have if it does not complicate our logic too much). I think a vector<1xindex> would also do the exact same thing (DCE, canonicalize kick in) with much less maintainence effort on our side.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keeping single element vector and relying on clean up passes to remove |
||||||
| - `mask`: is a vector of `i1` type, which is used to mask out the memory access. | ||||||
| mask is a vector of size equal to the subgroup size, or 1 in SIMT mode. | ||||||
| scalar mask is also valid for SIMT mode. | ||||||
| - `chunk_size`: (optional) represents contiguous number of elements to load from per work item. | ||||||
| - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache. | ||||||
|
|
||||||
| Results: | ||||||
| - `res`: represents loaded data | ||||||
|
|
||||||
|
|
||||||
| Example 1: | ||||||
| ```mlir | ||||||
|
|
@@ -691,19 +737,10 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> { | |||||
| vector<16xi1> -> vector<16x8xf32> | ||||||
| ``` | ||||||
|
|
||||||
| Example 3 (SIMT mode): | ||||||
| ```mlir | ||||||
| %2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>, | ||||||
| l2_hint = #xegpu.cache_hint<uncached>, | ||||||
| l3_hint = #xegpu.cache_hint<uncached>}> | ||||||
| : !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>> | ||||||
| vector<16xi1> -> vector<8xf32> | ||||||
| ``` | ||||||
|
|
||||||
| Example 4: | ||||||
| Example 3: | ||||||
| A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. | ||||||
| It combines "create scattered TensorTdesc" and "load with scattered TensorTdesc". | ||||||
| The source operand could be a raw pointer (uint64_t). Please refer to create_tdesc | ||||||
| The source operand could be a raw pointer (ui64, ui32, i64, i32). Please refer to create_tdesc | ||||||
| for the restriction of memref. | ||||||
| ```mlir | ||||||
| %a = memref.alloc() : memref<1024xf32> | ||||||
|
|
@@ -715,15 +752,24 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> { | |||||
| : memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32> | ||||||
| ``` | ||||||
|
|
||||||
| Example 4 (SIMT mode): | ||||||
| SIMT mode only accepts the offsets variant. chunk_size can be inferred from result | ||||||
| type. In this example, chunk_size is 8. | ||||||
| ```mlir | ||||||
| %2 = xegpu.load %1[%2], %0 <{l1_hint = #xegpu.cache_hint<cached>, | ||||||
| l2_hint = #xegpu.cache_hint<uncached>, | ||||||
| l3_hint = #xegpu.cache_hint<uncached>}> | ||||||
| : memref<128xf32>, vector<1xindex>, vector<1xi1> -> vector<8xf32> | ||||||
| ``` | ||||||
|
|
||||||
| }]; | ||||||
|
|
||||||
| let arguments = (ins XeGPU_GatherScatterSourceType: $source, | ||||||
| Optional<XeGPU_OffsetType>: $offsets, | ||||||
| XeGPU_MaskType: $mask, | ||||||
| OptionalAttr<I64Attr>: $chunk_size, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint); | ||||||
| let arguments = (ins XeGPU_GatherScatterSourceType:$source, | ||||||
| Optional<AnyTypeOf<[XeGPU_OffsetType, Index]>>:$offsets, | ||||||
| AnyTypeOf<[XeGPU_MaskType, I1]>:$mask, OptionalAttr<I64Attr>:$chunk_size, | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as above, consider moving I1 inside MarkTy?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think keeping it separate is better for now. Same reason as stated above for offsets. |
||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint); | ||||||
| let results = (outs XeGPU_ValueType: $value); | ||||||
|
|
||||||
| let extraClassDeclaration = extraBaseClassDeclaration # [{ | ||||||
|
|
@@ -777,15 +823,31 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> { | |||||
|
|
||||||
| def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> { | ||||||
| let summary = "store data to scattered memory locations."; | ||||||
| let description = [{ It (aka. store) stores data to scattered memory locations. The value is | ||||||
| let description = | ||||||
| [{ It (aka. store) stores data to scattered memory locations. The value is | ||||||
| typically a 1D vector. But when the chunk size of the TensorDesc is larger than 1, it will be | ||||||
| a 2D vector instead. For the later case, dim-1 of the value correspods to the simd lanes | ||||||
| and the dim-0 of the value corresponds to the chunk size stored per lane. So `store_scatter` | ||||||
| has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is | ||||||
| introduced on purpose, making sure users are aware of this implicit transformation. | ||||||
|
|
||||||
| In SIMT mode, the input vector represents the data to be stored by each work-item. | ||||||
| Each work-item stores a `chunk_size` number of elements. | ||||||
| In SIMT mode, the result is a 1D vector that represents the data to be stored by | ||||||
| each work-item. If size is not 1, size should be equal to the chunk size. | ||||||
|
|
||||||
| Arguments: | ||||||
| - `value`: represents the data to be stored. | ||||||
| - `dest`: represents the memory region to be stored to, which can be either a | ||||||
| tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). | ||||||
| In case of tensor_desc, offsets come from the producer create_tdesc op. | ||||||
| tensor_desc cannot be used in SIMT mode. | ||||||
| - `offsets`: represents offsets from dest. required if `source` in not a TensorDescType. | ||||||
| offsets is a vector of `index` type and vector length is either the subgroup size | ||||||
| or 1 in SIMT mode. scalar offset is also valid for SIMT mode. | ||||||
| - `mask`: is a vector of `i1` type, which is used to mask out the memory access. | ||||||
| mask is a vector of size equal to the subgroup size, or 1 in SIMT mode. | ||||||
| scalar mask is also valid for SIMT mode. | ||||||
| - `chunk_size`: (optional) represents contiguous number of elements to store to per work item. | ||||||
| - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache. | ||||||
|
|
||||||
| Example 1: | ||||||
| ```mlir | ||||||
|
|
@@ -803,15 +865,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> { | |||||
| : vector<16x8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1> | ||||||
| ``` | ||||||
|
|
||||||
| Example 3 (SIMT mode): | ||||||
| ```mlir | ||||||
| xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>, | ||||||
| l2_hint = #xegpu.cache_hint<write_back>, | ||||||
| l3_hint = #xegpu.cache_hint<write_through>}> | ||||||
| : vector<8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>> vector<16xi1> | ||||||
| ``` | ||||||
|
|
||||||
| Example 4: | ||||||
| Example 3: | ||||||
| A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. | ||||||
| It combines "create scattered TensorTdesc" and "store with scattered TensorTdesc". | ||||||
| The dest operand could be a raw pointer (uint64_t). | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank! Missed that one. |
||||||
|
|
@@ -827,17 +881,25 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> { | |||||
| : memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32> | ||||||
| ``` | ||||||
|
|
||||||
| Example 4 (SIMT mode): | ||||||
| SIMT mode only accepts the offsets variant. chunk_size can be inferred from value | ||||||
| type. In this example, chunk_size is 8. | ||||||
| ```mlir | ||||||
| xegpu.store %0, %1[%2], %3 <{l1_hint = #xegpu.cache_hint<uncached>, | ||||||
| l2_hint = #xegpu.cache_hint<write_back>, | ||||||
| l3_hint = #xegpu.cache_hint<write_through>}> | ||||||
| : vector<8xf32>, memref<256xf32>, vector<1xindex>, vector<1xi1> | ||||||
| ``` | ||||||
|
|
||||||
| }]; | ||||||
|
|
||||||
| let arguments = (ins | ||||||
| XeGPU_ValueType: $value, | ||||||
| XeGPU_GatherScatterSourceType: $dest, | ||||||
| Optional<XeGPU_OffsetType>: $offsets, | ||||||
| XeGPU_MaskType: $mask, | ||||||
| OptionalAttr<I64Attr>: $chunk_size, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint); | ||||||
| let arguments = (ins XeGPU_ValueType:$value, | ||||||
| XeGPU_GatherScatterSourceType:$dest, | ||||||
| Optional<AnyTypeOf<[XeGPU_OffsetType, Index]>>:$offsets, | ||||||
| AnyTypeOf<[XeGPU_MaskType, I1]>:$mask, OptionalAttr<I64Attr>:$chunk_size, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint, | ||||||
| OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint); | ||||||
|
|
||||||
| let extraClassDeclaration = extraBaseClassDeclaration # [{ | ||||||
| Type getDestType() { | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you help to update the doc here? It seems the definition has switched to use memref/pointer instead of TensorDesc. I didn't see XeGPU_GatherScatterSourceType contains TensorDesc, or maybe I miss it. But anyway, we are retiring the support for scattered TensorDesc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XeGPU_GatherScatterSourceTypedoes containXeGPU_TensorDescCheck here:
https://github.com/silee2/llvm-project/blob/updateScatterDesc/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTypes.td#L196-#L197
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my bad. just realized that
XeGPU_GatherScatterSourceTypeis different fromXeGPU_GatherScatterBaseSourceType