Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 65 additions & 25 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -500,7 +500,8 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
(scattered) subviews, allowing each work-item in a subgroup specifying their own offset.
It accepts the following parameters:

* source: a 1D memref or pointer (uint64_t) represents the flattened memory object.
* source: a 1D memref or pointer (i64, i32, ui64, ui32) represents the flattened
memory object.
* offsets: a vector containing offsets of each access point. Its size
is fixed to the hardware supportted subgroup size, e.g., 16 on PVC,
implying each element in the vector corresponds to a work-item (SIMT lane)
Expand All @@ -510,6 +511,8 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
match the dimension of offsets. It may also has a second dimension corresponding to
the chunk_size if the chunk size is larger than 1.

This op is not available in SIMT mode.

Example 1: It assumes subgroup size is 4, and accesses a[0], a[16], a[32], a[64]
```mlir
%a = memref.alloc() : memref<1024xf32>
Expand All @@ -536,7 +539,7 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
```
}];

let arguments = (ins XeGPU_BaseAddrType: $source,
let arguments = (ins XeGPU_GatherScatterBaseAddrType: $source,
XeGPU_OffsetType: $offsets);
let results = (outs XeGPU_TensorDesc:$TensorDesc);

Expand Down Expand Up @@ -617,6 +620,15 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> {
: memref<1024xf32>, vector<4xindex>
```

Example 3 (SIMT mode):
SIMT mode only accepts the offsets variant.
```mlir
xegpu.prefetch %0[%1] {l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<cached>,
l3_hint = #xegpu.cache_hint<cached>}
: memref<256xf32>, vector<1xindex>
```

}];

let arguments = (ins XeGPU_GatherScatterSourceType: $source,
Expand Down Expand Up @@ -670,8 +682,19 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
The mask operand masks out memory access so that it is safe to pass out-of-boundary
addresses/offsets as long as they are masked. It applies to slots of SIMD lanes.

In SIMT mode, the result vector represents the data to be loaded by each work-item.
Each work-item recieves a `chunk_size` number of elements.
In SIMT mode, the result is a 1D vector that represents the data to be loaded by
each work-item. If size is not 1, size should be equal to the chunk size,

`source` represents the memory region to be loaded from, which can be either a
tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32).
In case of tensor_desc, offsets come from the producer create_tdesc op.
tensor_desc cannot be used in SIMT mode.
`offsets` represents offsets from source. required if `source` in not a TensorDescType.
offsets is a vector of `index` type and vector length is either the subgroup size
or 1 in SIMT mode.
`mask` is a vector of `i1` type, which is used to mask out the memory access.
mask is a vector of size equal to the subgroup size, or 1 in SIMT mode.
`chunk_size` (optional) represents contiguous number of elements to load from per work item.

Example 1:
```mlir
Expand All @@ -691,16 +714,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
vector<16xi1> -> vector<16x8xf32>
```

Example 3 (SIMT mode):
```mlir
%2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>>
vector<16xi1> -> vector<8xf32>
```

Example 4:
Example 3:
A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc.
It combines "create scattered TensorTdesc" and "load with scattered TensorTdesc".
The source operand could be a raw pointer (uint64_t). Please refer to create_tdesc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The source operand could be a raw pointer (uint64_t). Please refer to create_tdesc
The source operand could be a raw pointer (ui64, ui32, i64 or i32). Please refer to create_tdesc

Expand All @@ -715,6 +729,16 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
: memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32>
```

Example 4 (SIMT mode):
SIMT mode only accepts the offsets variant. chunk_size can be inferred from result
type. In this example, chunk_size is 8.
```mlir
%2 = xegpu.load %1[%2], %0 <{l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}>
: memref<128xf32>, vector<1xindex>, vector<1xi1> -> vector<8xf32>
```

}];

let arguments = (ins XeGPU_GatherScatterSourceType: $source,
Expand Down Expand Up @@ -784,8 +808,20 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is
introduced on purpose, making sure users are aware of this implicit transformation.

In SIMT mode, the input vector represents the data to be stored by each work-item.
Each work-item stores a `chunk_size` number of elements.
In SIMT mode, the result is a 1D vector that represents the data to be stored by
each work-item. If size is not 1, size should be equal to the chunk size.

`value` represents the data to be stored.
`dest` represents the memory region to be stored to, which can be either a
tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32).
In case of tensor_desc, offsets come from the producer create_tdesc op.
tensor_desc cannot be used in SIMT mode.
`offsets` represents offsets from dest. required if `source` in not a TensorDescType.
offsets is a vector of `index` type and vector length is either the subgroup size
or 1 in SIMT mode.
`mask` is a vector of `i1` type, which is used to mask out the memory access.
mask is a vector of size equal to the subgroup size, or 1 in SIMT mode.
`chunk_size` (optional) represents contiguous number of elements to store to per work item.

Example 1:
```mlir
Expand All @@ -803,15 +839,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
: vector<16x8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1>
```

Example 3 (SIMT mode):
```mlir
xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}>
: vector<8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>> vector<16xi1>
```

Example 4:
Example 3:
A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc.
It combines "create scattered TensorTdesc" and "store with scattered TensorTdesc".
The dest operand could be a raw pointer (uint64_t).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The dest operand could be a raw pointer (uint64_t).
The dest operand could be a raw pointer (ui64, ui32, i64 or i32).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank! Missed that one.

Expand All @@ -827,6 +855,16 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
: memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32>
```

Example 4 (SIMT mode):
SIMT mode only accepts the offsets variant. chunk_size can be inferred from value
type. In this example, chunk_size is 8.
```mlir
xegpu.store %0, %1[%2], %3 <{l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}>
: vector<8xf32>, memref<256xf32>, vector<1xindex>, vector<1xi1>
```

}];

let arguments = (ins
Expand Down Expand Up @@ -895,6 +933,8 @@ def XeGPU_UpdateOffsetOp: XeGPU_Op<"update_offset",
update the offset per work-item, so its offsets contains values representing
shifts for each work-item.

This op is not available in SIMT mode.

Example:
```mlir
%off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
Expand Down
6 changes: 4 additions & 2 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTypes.td
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,15 @@ include "mlir/IR/BuiltinTypes.td"
def XeGPU_IntType: AnyTypeOf<[I1, I8, I16, I32, I64, SI1, SI8, SI16, SI32, SI64, UI1, UI8, UI16, UI32, UI64]>;
def XeGPU_FloatType: AnyTypeOf<[F16, F32, F64, BF16, TF32]>;
def XeGPU_ScalarType: AnyTypeOf<[XeGPU_IntType, XeGPU_FloatType]>;
def XeGPU_BaseAddrType: AnyTypeOf<[Non0RankedMemRefOf<[XeGPU_ScalarType]>, UI64, UI32, I64, I32]>;
def XeGPU_PointerType: AnyTypeOf<[UI64, UI32, I64, I32]>;
def XeGPU_BaseAddrType: AnyTypeOf<[Non0RankedMemRefOf<[XeGPU_ScalarType]>, XeGPU_PointerType]>;
def XeGPU_DpasOprType: FixedVectorOfRankAndType<[1, 2, 3], [XeGPU_ScalarType]>;
def XeGPU_DpasResType: FixedVectorOfRankAndType<[1, 2], [XeGPU_ScalarType]>;
def XeGPU_OffsetType: FixedVectorOfNonZeroRankOf<[Index]>;
def XeGPU_MaskType: FixedVectorOfNonZeroRankOf<[I1]>;
def XeGPU_ValueType: FixedVectorOfNonZeroRankOf<[XeGPU_ScalarType]>;
def XeGPU_VectorType: VectorOfRankAndType<[1,2,3,4,5,6], [XeGPU_ScalarType]>;
def XeGPU_GatherScatterBaseAddrType: AnyTypeOf<[MemRefRankOf<[XeGPU_ScalarType], [1]>, XeGPU_PointerType]>;

// common base class for types in XeGPU dialect
class XeGPUTypeDef<string name, string typeMnemonic, list<Trait> traits = [],
Expand Down Expand Up @@ -189,7 +191,7 @@ def XeGPU_TensorDesc: XeGPUTypeDef<"TensorDesc", "tensor_desc",
let genVerifyDecl = 1;
}

def XeGPU_GatherScatterSourceType : AnyTypeOf<[XeGPU_TensorDesc,Non0RankedMemRefOf<[XeGPU_ScalarType]>, UI64]>;
def XeGPU_GatherScatterSourceType : AnyTypeOf<[XeGPU_TensorDesc,XeGPU_GatherScatterBaseAddrType]>;

def XeGPU_Nbarrier: XeGPUTypeDef<"Nbarrier", "nbarrier", [], "mlir::Type"> {
let summary = "!xegpu.nbarrier a custom XeGPU type representing a barrier.";
Expand Down
41 changes: 18 additions & 23 deletions mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,6 @@ static SmallVector<int64_t> getShapeOf(Type type) {
return shape;
}

static int64_t getRankOf(Value val) {
auto type = val.getType();
if (auto ty = llvm::dyn_cast<ShapedType>(type))
return ty.getRank();
return 0;
}

static bool isReadHintOrNone(const CachePolicyAttr &attr) {
if (!attr)
return true;
Expand Down Expand Up @@ -685,10 +678,6 @@ void CreateDescOp::build(OpBuilder &builder, OperationState &state,
LogicalResult CreateDescOp::verify() {
auto tdescTy = getTensorDescType();

if (getRankOf(getSource()) > 1)
return emitOpError(
"Expecting the source is a 1D memref or pointer (uint64_t).");

if (!tdescTy.isScattered())
return emitOpError("Expects a scattered TensorDesc.\n");

Expand Down Expand Up @@ -723,13 +712,15 @@ LogicalResult CreateDescOp::verify() {
LogicalResult PrefetchOp::verify() {
auto tdescTy = getTensorDescType();

if (!tdescTy && !getOffsets())
return emitOpError("Expects offsets.");

if (tdescTy && getOffsets())
return emitOpError("offsets not allowed.");

if (tdescTy && !tdescTy.isScattered())
return emitOpError("Expects a scattered TensorDesc.");

if (!tdescTy && getRankOf(getSource()) > 1)
return emitOpError(
"Expecting the source is a 1D memref or pointer (uint64_t).");

if (!isReadHintOrNone(getL1HintAttr()))
return emitOpError("invalid l1_hint: ") << getL1HintAttr();

Expand Down Expand Up @@ -757,13 +748,15 @@ LogicalResult LoadGatherOp::verify() {
auto maskTy = getMaskType();
auto valueTy = getValueType();

if (!tdescTy && !getOffsets())
return emitOpError("Expects offsets.");

if (tdescTy && getOffsets())
return emitOpError("offsets not allowed.");

if (tdescTy && !tdescTy.isScattered())
return emitOpError("Expects a scattered TensorDesc.");

if (!tdescTy && getRankOf(getSource()) > 1)
return emitOpError(
"Expecting the source is a 1D memref or pointer (uint64_t).");

if (!isReadHintOrNone(getL1HintAttr()))
return emitOpError("invalid l1_hint: ") << getL1HintAttr();

Expand Down Expand Up @@ -804,13 +797,15 @@ LogicalResult StoreScatterOp::verify() {
auto maskTy = getMaskType();
auto valueTy = getValueType();

if (!tdescTy && !getOffsets())
return emitOpError("Expects offsets.");

if (tdescTy && getOffsets())
return emitOpError("offsets not allowed.");

if (tdescTy && !tdescTy.isScattered())
return emitOpError("Expects a scattered TensorDesc.");

if (!tdescTy && getRankOf(getDest()) > 1)
return emitOpError(
"Expecting the dest is a 1D memref or pointer (uint64_t).");

if (!isWriteHintOrNone(getL1HintAttr()))
return emitOpError("invalid l1_hint: ") << getL1HintAttr();

Expand Down
61 changes: 58 additions & 3 deletions mlir/test/Dialect/XeGPU/invalid.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -387,11 +387,28 @@ func.func @load_gather_vc_3(%src: ui64) {
// -----
func.func @prefetch_offset_wi_1(%src: memref<4x4xf32>) {
%offsets = arith.constant dense<[0]> : vector<1xindex>
// expected-error@+1 {{Expecting the source is a 1D memref or pointer}}
// expected-error@+1 {{op operand #0 must be TensorDesc describing regions of interested data}}
xegpu.prefetch %src[%offsets]: memref<4x4xf32>, vector<1xindex>
return
}

// -----
func.func @prefetch_offset_wi_2(%src: memref<16xf32>) {
%offsets = arith.constant dense<[0]> : vector<1xindex>
%1 = xegpu.create_tdesc %src, %offsets : memref<16xf32>, vector<1xindex>
-> !xegpu.tensor_desc<1x3xf32, #xegpu.scatter_tdesc_attr<chunk_size = 3>>
// expected-error@+1 {{offsets not allowed}}
xegpu.prefetch %1[%offsets]: !xegpu.tensor_desc<1x3xf32, #xegpu.scatter_tdesc_attr<chunk_size = 3>>, vector<1xindex>
return
}

// -----
func.func @prefetch_offset_wi_3(%src: memref<16xf32>) {
// expected-error@+1 {{Expects offsets}}
xegpu.prefetch %src: memref<16xf32>
return
}

// -----
func.func @load_gather_offset_sg(%src: memref<?xf16>) {
%offsets = arith.constant dense<[0, 8, 16, 24]> : vector<4xindex>
Expand Down Expand Up @@ -428,12 +445,50 @@ func.func @store_scatter_offset_wi_2(%src: memref<4x4xf16>) {
%val = arith.constant dense<2.9>: vector<4xf16>
%offsets = arith.constant dense<[0]> : vector<1xindex>
%mask = arith.constant dense<1>: vector<1xi1>
// expected-error@+1 {{Expecting the dest is a 1D memref or pointer}}
// expected-error@+1 {{op operand #1 must be TensorDesc describing regions of interested data}}
xegpu.store %val, %src[%offsets], %mask
: vector<4xf16>, memref<4x4xf16>, vector<1xindex>, vector<1xi1>
return
}

// -----
func.func @store_scatter_offset_wi_3(%src: memref<16xf16>) {
%val = arith.constant dense<2.9>: vector<1xf16>
%mask = arith.constant dense<1>: vector<1xi1>
// expected-error@+1 {{Expects offsets}}
xegpu.store %val, %src, %mask
: vector<1xf16>, memref<16xf16>, vector<1xi1>
return
}

// -----
func.func @store_scatter_offset_wi_4(%src: !xegpu.tensor_desc<1x1xf32, #xegpu.scatter_tdesc_attr<>>) {
%val = arith.constant dense<2.9>: vector<1xf16>
%offsets = arith.constant dense<[0]> : vector<1xindex>
%mask = arith.constant dense<1>: vector<1xi1>
// expected-error@+1 {{offsets not allowed}}
xegpu.store %val, %src[%offsets], %mask
: vector<1xf16>, !xegpu.tensor_desc<1x1xf32, #xegpu.scatter_tdesc_attr<>>, vector<1xindex>, vector<1xi1>
return
}

// -----
func.func @load_gather_offset_wi_4(%src: !xegpu.tensor_desc<1x2xf16, #xegpu.scatter_tdesc_attr<>>) {
%mask = arith.constant dense<1>: vector<1xi1>
%offsets = arith.constant dense<[0]> : vector<1xindex>
// expected-error@+1 {{offsets not allowed}}
%2 = xegpu.load %src[%offsets], %mask <{chunk_size = 2}> : !xegpu.tensor_desc<1x2xf16, #xegpu.scatter_tdesc_attr<>>, vector<1xindex>, vector<1xi1> -> vector<2xf16>
return
}

// -----
func.func @load_gather_offset_wi_3(%src: ui64) {
%mask = arith.constant dense<1>: vector<1xi1>
// expected-error@+1 {{Expects offsets}}
%2 = xegpu.load %src, %mask <{chunk_size = 2}> : ui64, vector<1xi1> -> vector<2xf16>
return
}

// -----
func.func @load_gather_offset_wi_2(%src: ui64) {
%mask = arith.constant dense<1>: vector<1xi1>
Expand All @@ -447,7 +502,7 @@ func.func @load_gather_offset_wi_2(%src: ui64) {
func.func @load_gather_offset_wi_1(%src: memref<4x4xf32>) {
%mask = arith.constant dense<1>: vector<1xi1>
%offsets = arith.constant dense<[0]> : vector<1xindex>
// expected-error@+1 {{Expecting the source is a 1D memref or pointer}}
// expected-error@+1 {{op operand #0 must be TensorDesc describing regions of interested data}}
%2 = xegpu.load %src[%offsets], %mask <{chunk_size = 2}> : memref<4x4xf32>, vector<1xindex>, vector<1xi1> -> vector<2xf32>
return
}
Expand Down