-
Notifications
You must be signed in to change notification settings - Fork 791
[DT][SVE] DT support for scalable tiles - encoding materialization for mmt4d #21304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DT][SVE] DT support for scalable tiles - encoding materialization for mmt4d #21304
Conversation
a7f6352 to
53e254d
Compare
|
Thanks for the PR, it does remind me to review llvm/llvm-project#146531; I'll take a look today.
I'm not an expert here, but I feel that it could be off if the host and device are not running on the same device. How does vscale work in practice? My understanding is that it is a runtime constant depending on execution target. I think the host side needs to query the value somehow; we likely need a new mechanism for propagating vscale value. I have few ideas here and other ideas are welcome:
If (1) is doable, you can always replace the vscale with a specific number, and we'll be able to do const-eval (not ready yet). If (1) is not doable and (2) is doable, we may only do it in initializer, but not const-eval. |
|
I think the padding size depends on vscale, so maybe the easiest way to unblock the prototype is using maximal value for |
|
Thanks for the feedback!
Technically speaking, every target that implements SVE would implement a specific vector length, varying between 128-2048 and vscale is the factor to 128. Therefore, this value is known and fixed at runtime. Although the point of having scalable vectors in the first place IIUC is doing vector-length agnostic programming. So although it is a step forwards to get and propagate the specific vscale value for a specific target, the goal should IMO be to keep this as a runtime value. On that regard:
seems to be a more on point to me :)
I think if we were to keep the value of vscale a runtime constant, we cannot do const-eval. But I had a question, is the initializer run on the "host" side or "device" side? Nevertheless, I think the vscale op is illegal for the initializer as of now, so that's an action item :) I'll mark that on the issue I'll be creating. |
|
Thank you for starting the discussion @egebeysel and for helping with this super exciting project! :)
Yes.
Indeed. However, there could be an intermediate step that would work well for now - we make the host query itself. Could that work? This would only make sense when "host" and "device" are physically identical - ATM that's the case for the devices that we are targeting. However, it wouldn't work for e.g. M4 (there's SME, but no SVE).
Yes, but this would mean that we assume that there's 64
+100 Ideally, IREE would allow 3 options:
To me, [1] feels like the most natural approach, but also most involved and we may see perf degradation. [2] and [3] are good stepping stones. It's hard to pick a "favourite" without more perf data. |
IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d. That's my current impression, please correct me if I'm mistaken or overseeing something.
I also think that this would be good to have to unblock further progression. Also FYI: I'm drafting the issue for this and will ping you on it to move the discussion there. |
banach-space
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @egebeysel , from the SVE perspective, this encoding materialisation makes sense to me. Your approach, i..e the implementation also looks good to me, but I will defer to experts to confirm/review, CC @hanhanW .
I have left a number of comments, but that's mostly to help with the wider SVE context in various comments, and to set the direction.
IIUC, we can proceed with this while discussing the other issues separately:
as in, I don't see anything in those issue that would block this. Please let me know if I missed something. Thanks!
| // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary | ||
| // and the temporary SME flag should probably come next to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isScalable*Enabled flag should be temporary
Two things need to happen before we can enable "scalable vectorization" by default.
- Perf parity between NEON and SVE. We might actually achieve that once data tiling supports SVE.
- Stress testing of SVE. We already do that to a degree downstream, but we need to do more :)
In the meantime, we do need to keep isScalableVectorizationEnabled. Even when SVE reaches maturity, we may still want to keep a flag to fall back to NEON (e.g. for experimentation or to prototype new features).
the temporary SME flag should probably come next to it.
We enable SME when:
- Scalable vectorization is "ON".
+smeis present as a target feature.
| // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary | ||
| // and the temporary SME flag should probably come next to it. | ||
| if (!isAArch64(config) || !isScalableVectorizationEnabled()) | ||
| return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps ...
| return failure(); | |
| return notifyMatchFailure("Pre-conditions to enable scalable tiling are not met"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this in a debug print :) Hope that also works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")
| #define DEBUG_TYPE "iree-codegen-materialize-encoding" | |
| #define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ") | |
| #define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n") |
| // TODO(egebeysel): Choosing between SME | ||
| // & SVE should also probably reside around here. I guess we could also just | ||
| // choose SME and then have the fallback option implemented with | ||
| // 2DScalableTo1D pass and also even from 1D to non-scalable. Currently, | ||
| // hard-coded to pick SVE since SME with DT is not yet supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would trim this comment:
| // TODO(egebeysel): Choosing between SME | |
| // & SVE should also probably reside around here. I guess we could also just | |
| // choose SME and then have the fallback option implemented with | |
| // 2DScalableTo1D pass and also even from 1D to non-scalable. Currently, | |
| // hard-coded to pick SVE since SME with DT is not yet supported. | |
| // TODO(egebeysel): Add logic for SME. |
Basically, the following means that SME is on and should be used instead of SVE:
- `--iree-global-opt-enable-early-materialization=false --iree-llvmcpu-target-cpu-features="+sve,+sme".
We shouldn't require any other logic. With this in mind, since we don't support SME yet, it would be good to bail out if +sme is added on the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently doing this, one thing though - do we need the +sve flag for SME or is +sme enough by itself? Because ... Apple M4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.
Oh thanks! So you're saying if we need both of the flags for SME (codegen) to be on, then I should name them both. That makes sense. The next question would be if we have such implication between SME & SVE, i.e. if we need +sve for SME to be on. Because AFAIK, M4 only happens to support SME and not SVE :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that M4 does support both SVE and SME, the answer is that yes you should specify both, in the same way as it is useful to mention +i8mm,+dotprod and not just +i8mm. On x86 we have a worse version of this where we mention not only all AVX-512 supported features but also AVX, AVX2, FMA below that.
- I don't believe it to be even possible for a CPU to support SME but not SVE. SME instructions have register operands that are SVE registers Z0-Z31, used for LHS and RHS tiles, separately from the ZA registers introduced by SME for the accumulator. You could check the armv9 architecture reference manual and I would expect it to say that SME implies SVE.
One way out of that is, instead of passing --iree-llvmcpu-target-cpu-features=..., you can pass --iree-llvmcpu-target-cpu=apple-m4. It is up to each CPU architecture to implement the semantics of that flag so on some CPU architectures this will fail to populate CPU features. But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.
On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen. In practice it's up to each LLVM target backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it to be even possible for a CPU to support SME but not SVE
It is :)
Z0-Z31 registers from SME may (but don't have to) overlap with Z0-Z31 registers from SVE. And M4 is one example where SME was added while SVE was left out (see Wikipedia for Apple_M4).
On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen.
Side note, I am told that for big OoO cores, LLVM scheduling models are not really used/relevant.
I appreciate that I am not providing any "official" links here, but most of these things are not so visibly documented 😅
do we need the +sve flag for SME or is +sme enough by itself?
Given the above, on M4 you would only use +sme. Or, even better, what @bjacob suggested:
But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @banach-space , I stand corrected!
compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp
Outdated
Show resolved
Hide resolved
| @@ -1,4 +1,5 @@ | |||
| // RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --verify-diagnostics %s | FileCheck %s | |||
| // RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --iree-llvmcpu-enable-scalable-vectorization=true --verify-diagnostics %s | FileCheck %s --check-prefix=VSCALE | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Here and in other places, avoid VSCALE as a check-prefix, it clashes with vector.vscale and VSCALE LIT variable. Perhaps SCALABLE_VEC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir
Show resolved
Hide resolved
| #encoding_result = #iree_encoding.encoding<operand_index = 2, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], iteration_sizes = [?, ?, ?]> | ||
| func.func @matmul_lowering_f32f32f32_aarch64() attributes { | ||
| hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}> | ||
| hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {cpu_features="+sve", target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so with this change and with the new RUN line, you demonstrate that:
+svealone is not sufficient to get scalable tile sizes.
Makes sense.
compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir
Outdated
Show resolved
Hide resolved
d996995 to
a10d800
Compare
compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir
Show resolved
Hide resolved
hanhanW
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself on materialization looks okay to me. How to lower vscale op or query the value would be tricky on host compilation.
- Query the "device" to get the value of vscale. IREE would do it either at compile or at run-time - the latter would be preferable, but then we would definitely loose const-eval, right?
If we can't query it in compile-time, we can only do it in initialization phase.
- Provide a compile-time flag/option/mechanism to specify vscale. This goes against the Vector Length Agnostic approach (we've avoided specifying vscale at compile time thus far), but would unblock us and allow us to experiment/iterate. This would allow us to keep const-eval, right?
Yes, it should allow us to keep const-eval, because you can interpret the scalable flags in terms of concrete layouts.
- Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?
IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like 16x16 -> 2x?x8x[8]. Essentially, the part that decides if we can do const-eval is the layout that the data actually has to be in and packing the data as if we have vscale=16 regardless of the actual vscale value (and padding the rest) would give you const-eval, e.g. if we the above example results in the "static" packing of 16x16 -> 2x1x8x64. Although I have a hunch that this would actually give us worst performance than no data-tiling.
On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. ---DATA----******************* where * denotes unused memory should be valid. This lets us avoid making the "host-side" aware of the actual value of vscale and allocates a definitely big enough buffer.
So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d.
That's my current impression, please correct me if I'm mistaken or overseeing something.
yeah, I think const-eval is still not available if vscale can only be queried during runtime. E.g., if you don't have the target device, how do you get the vscale value in compile-time?
A potential solution may be dup the weights several times (with different vscale values) and use the corresponding based on vscale query, but it make memory usage larger; we are far to experiment it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test to https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/unittests/UtilsTest.cpp?
Generally, there are limited unit tests in LLVM/MLIR-based projects. However, it is tricky to separate the tests from stream, because it is the intention of encoding specialization. Perhaps we can create a testing pass in the future. For now, I'd add few unit tests for it because it is a generic util/struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one, thanks for the pointer. I didn't realise I forgot to implement some operator overloads for the scalable tiles, added them as well!
| // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary | ||
| // and the temporary SME flag should probably come next to it. | ||
| if (!isAArch64(config) || !isScalableVectorizationEnabled()) | ||
| return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")
| #define DEBUG_TYPE "iree-codegen-materialize-encoding" | |
| #define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ") | |
| #define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n") |
| // For shape inference, mark scalable tiles as dynamic sizes for the | ||
| // shape inference. Note, scalable tiles that are represented with | ||
| // static inner tile sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove dup words. (please format the comment. I can't format it easily because I edited it on web UI.)
| // For shape inference, mark scalable tiles as dynamic sizes for the | |
| // shape inference. Note, scalable tiles that are represented with | |
| // static inner tile sizes. | |
| // Mark scalable tiles as dynamic sizes for the | |
| // shape inference. Note, scalable tiles that are represented with | |
| // static inner tile sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Ege Beysel <[email protected]>
Signed-off-by: Ege Beysel <[email protected]>
a10d800 to
94d3d53
Compare
Thanks for the review and for your comments! I opened an issue for the storage size calculation and const-evaluation which IIUC entail the issues you mention in your other comments as well. I think we should keep this PR for the materialisation only and discuss them on the issues. If you're talking about another issue about host-compilation, I'd be happy to discuss it offline or in an issue as well :) For now, this should not change the behaviour for data-tiling without explicitly enabling SVE and SVE without data-tiling. To make this work end-to-end, I'll split the remaining issues and address them in separate PRs. Therefore, I'd humbly ask for one more review after addressing the comments :) Thanks! Also JFYI, I unfortunately cannot trigger the CI runs by myself, so feel free to turn on automatic merging if you were to approve the PR :) |
cf5f593 to
23de490
Compare
… add unit tests Signed-off-by: Ege Beysel <[email protected]>
23de490 to
3d6d425
Compare
hanhanW
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two nits, I'll land the PR after they are fixed.
| auto extractedScalableTiles = llvm::to_vector(llvm::map_range( | ||
| cast<ArrayAttr>(dictAttr.getNamed("scalableTiles")->getValue()), | ||
| [](Attribute a) { return cast<BoolAttr>(a).getValue(); })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're looking for is using llvm::map_to_vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, exactly :)
| // Mark scalable tiles as dynamic sizes for the | ||
| // shape inference. Note, scalable tiles that are represented with | ||
| // static inner tile sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you format it? I think shape inference can be at the first line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, sorry I forgot!
3d6d425 to
10d9d20
Compare
Signed-off-by: Ege Beysel <[email protected]>
Head branch was pushed to by a user without write access
10d9d20 to
e14df68
Compare
banach-space
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Thanks for the reviews, just a quick ping since I can't merge :) @hanhanW @banach-space |
|
Thanks, I landed the change with removing the CC + adding two tracking issue ID in the commit description. |
hanhanW
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I did not see the issue in the first place. I think we need a few more tests that materializes set_encoding/unset_encoding into pack/unpack with scalable variants.
Can you help add test cases in a follow-up? I think we can do something similar to the above, but switches it to aarch64 version.
iree/compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_x86_64.mlir
Lines 3 to 18 in 8f61580
| #pipeline_layout = #hal.pipeline.layout<bindings = [ | |
| #hal.pipeline.binding<storage_buffer>, | |
| #hal.pipeline.binding<storage_buffer> | |
| ]> | |
| #encoding = #iree_encoding.encoding<operand_index = 0, op_type = matmul, element_types = [bf16, bf16, bf16], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [1, 1000, ?]> | |
| func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes { | |
| hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}> | |
| }{ | |
| %c0 = arith.constant 0 : index | |
| %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> | |
| %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>> | |
| %2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> -> tensor<1x1000xbf16> | |
| %3 = iree_encoding.set_encoding %2 : tensor<1x1000xbf16> -> tensor<1x1000xbf16, #encoding> | |
| iree_tensor_ext.dispatch.tensor.store %3, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xbf16, #encoding> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>> | |
| return | |
| } |
Note: we can't return the result of set encoding directly, because we don't have a pattern to materialize encodings for func ops. It is on my TODO list, but I never get cycles for this: #20825
Oh yeah, I see what's missing. Let me take care of this. EDIT: #21560 |
Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable.
Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. Signed-off-by: Andrzej Warzynski <[email protected]>
Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. Signed-off-by: Andrzej Warzynski <[email protected]>
) Since #21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]>
…e-org#21560) Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]>
…r mmt4d (iree-org#21304) This PR enables encoding materialization/specialization for mmt4d. It is a step towards iree-org#16162. This PR is the first out of N to follow on that regard. Currently, materialization and specialization of mmt4d encodings are supported, although this does not yet work e2e. The current tiling setup does not yet out-of-the box support scalable inner tiles. The inner tile sizes that we choose here are supposed to be scalability-agnostic, although we should definitely go back to this on another issue and empirically support this. Generally, this works for device encoding materialization, but I'm not entirely sure how this would do for host encodings. I'd appreciate your opinions/guidance on that regard. I'm not entirely sure if we want to allow vscale ops to reside in host side code. I've also done some investigation on enabling this e2e, however there are some issues and I will create separate issues with reproducers for them ASAIC. But a quick sneak peek: - Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side. Tracking issue: iree-org#21317 - Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :) Tracking issue: iree-org#21319 - Tile size and scalable tile flag propagation in dispatches with pack/unpack/mmt4d: the tile sizes and (more importantly) scalable tile flag information should be correctly propagated to other ops in the same dispatch. --------- Signed-off-by: Ege Beysel <[email protected]> Signed-off-by: keshavvinayak01 <[email protected]>
…e-org#21560) Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]> Signed-off-by: keshavvinayak01 <[email protected]>

This PR enables encoding materialization/specialization for mmt4d. It is a step towards #16162. This PR is the first out of N to follow on that regard.
Currently, materialization and specialization of mmt4d encodings are supported, although this does not yet work e2e. The current tiling setup does not yet out-of-the box support scalable inner tiles. The inner tile sizes that we choose here are supposed to be scalability-agnostic, although we should definitely go back to this on another issue and empirically support this.
Generally, this works for device encoding materialization, but I'm not entirely sure how this would do for host encodings. I'd appreciate your opinions/guidance on that regard. I'm not entirely sure if we want to allow vscale ops to reside in host side code.
I've also done some investigation on enabling this e2e, however there are some issues and I will create separate issues with reproducers for them ASAIC. But a quick sneak peek:
cc @banach-space