[DT][SVE] DT support for scalable tiles - encoding materialization for mmt4d #21304

egebeysel · 2025-07-08T15:04:03Z

This PR enables encoding materialization/specialization for mmt4d. It is a step towards #16162. This PR is the first out of N to follow on that regard.

Currently, materialization and specialization of mmt4d encodings are supported, although this does not yet work e2e. The current tiling setup does not yet out-of-the box support scalable inner tiles. The inner tile sizes that we choose here are supposed to be scalability-agnostic, although we should definitely go back to this on another issue and empirically support this.

Generally, this works for device encoding materialization, but I'm not entirely sure how this would do for host encodings. I'd appreciate your opinions/guidance on that regard. I'm not entirely sure if we want to allow vscale ops to reside in host side code.

I've also done some investigation on enabling this e2e, however there are some issues and I will create separate issues with reproducers for them ASAIC. But a quick sneak peek:

Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side.
Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :)
Tile size and scalable tile flag propagation in dispatches with pack/unpack/mmt4d: the tile sizes and (more importantly) scalable tile flag information should be correctly propagated to other ops in the same dispatch.

cc @banach-space

hanhanW · 2025-07-08T18:00:20Z

Thanks for the PR, it does remind me to review llvm/llvm-project#146531; I'll take a look today.

Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side.

I'm not an expert here, but I feel that it could be off if the host and device are not running on the same device. How does vscale work in practice? My understanding is that it is a runtime constant depending on execution target. I think the host side needs to query the value somehow; we likely need a new mechanism for propagating vscale value. I have few ideas here and other ideas are welcome:

We describe the cpu features in executable_target, do we know the vscale for the given target during compilation time? If so, we can probably attach such information to the config and use it in specialization on host side.
Maybe we need a mechanism to propagate such runtime constant to host (and vice versa if we want to provide the number of available threads to device for distribution, but it is a different topic), so the host can use the value to do the allocation.

Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :)

If (1) is doable, you can always replace the vscale with a specific number, and we'll be able to do const-eval (not ready yet). If (1) is not doable and (2) is doable, we may only do it in initializer, but not const-eval.

hanhanW · 2025-07-08T18:17:11Z

I think the padding size depends on vscale, so maybe the easiest way to unblock the prototype is using maximal value for vscale on host side.

egebeysel · 2025-07-08T18:45:05Z

Thanks for the feedback!

We describe the cpu features in executable_target, do we know the vscale for the given target during compilation time? If so, we can probably attach such information to the config and use it in specialization on host side.

Technically speaking, every target that implements SVE would implement a specific vector length, varying between 128-2048 and vscale is the factor to 128. Therefore, this value is known and fixed at runtime. Although the point of having scalable vectors in the first place IIUC is doing vector-length agnostic programming. So although it is a step forwards to get and propagate the specific vscale value for a specific target, the goal should IMO be to keep this as a runtime value. On that regard:

2. Maybe we need a mechanism to propagate such runtime constant to host (and vice versa if we want to provide the number of available threads to device for distribution, but it is a different topic), so the host can use the value to do the allocation.

seems to be a more on point to me :)

(2) is doable, we may only do it in initializer, but not const-eval.

I think if we were to keep the value of vscale a runtime constant, we cannot do const-eval. But I had a question, is the initializer run on the "host" side or "device" side? Nevertheless, I think the vscale op is illegal for the initializer as of now, so that's an action item :) I'll mark that on the issue I'll be creating.

banach-space · 2025-07-08T21:58:42Z

Thank you for starting the discussion @egebeysel and for helping with this super exciting project! :)

My understanding is that it is a runtime constant depending on execution target.

Yes.

I think the host side needs to query the value somehow; we likely need a new mechanism for propagating vscale value.

Indeed. However, there could be an intermediate step that would work well for now - we make the host query itself. Could that work?

This would only make sense when "host" and "device" are physically identical - ATM that's the case for the devices that we are targeting. However, it wouldn't work for e.g. M4 (there's SME, but no SVE).

I think the padding size depends on vscale, so maybe the easiest way to unblock the prototype is using maximal value for vscale on host side.

Yes, but this would mean that we assume that there's 64 i32 values per vec register. That's instead of 4 that you get for implementations that have vectors that are 128 bits wide. So, we would potentially allocating much more then needed in practice. IMO this would be OK as a stepping stone, but not as a long-term solution.

So although it is a step forwards to get and propagate the specific vscale value for a specific target, the goal should IMO be to keep this as a runtime value.

+100

Ideally, IREE would allow 3 options:

Query the "device" to get the value of vscale. IREE would do it either at compile or at run-time - the latter would be preferable, but then we would definitely loose const-eval, right?
Provide a compile-time flag/option/mechanism to specify vscale. This goes against the Vector Length Agnostic approach (we've avoided specifying vscale at compile time thus far), but would unblock us and allow us to experiment/iterate. This would allow us to keep const-eval, right?
Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

To me, [1] feels like the most natural approach, but also most involved and we may see perf degradation. [2] and [3] are good stepping stones. It's hard to pick a "favourite" without more perf data.

egebeysel · 2025-07-09T10:11:54Z

3. Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like 16x16 -> 2x?x8x[8]. Essentially, the part that decides if we can do const-eval is the layout that the data actually has to be in and packing the data as if we have vscale=16 regardless of the actual vscale value (and padding the rest) would give you const-eval, e.g. if we the above example results in the "static" packing of 16x16 -> 2x1x8x64. Although I have a hunch that this would actually give us worst performance than no data-tiling.

On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. ---DATA----******************* where * denotes unused memory should be valid. This lets us avoid making the "host-side" aware of the actual value of vscale and allocates a definitely big enough buffer.

So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d.

That's my current impression, please correct me if I'm mistaken or overseeing something.

Indeed. However, there could be an intermediate step that would work well for now - we make the host query itself. Could that work?

I also think that this would be good to have to unblock further progression.

Also FYI: I'm drafting the issue for this and will ping you on it to move the discussion there.

banach-space

Hey @egebeysel , from the SVE perspective, this encoding materialisation makes sense to me. Your approach, i..e the implementation also looks good to me, but I will defer to experts to confirm/review, CC @hanhanW .

I have left a number of comments, but that's mostly to help with the wider SVE context in various comments, and to set the direction.

IIUC, we can proceed with this while discussing the other issues separately:

as in, I don't see anything in those issue that would block this. Please let me know if I missed something. Thanks!

banach-space · 2025-07-15T12:18:17Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp

+  // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
+  // and the temporary SME flag should probably come next to it.


I think this isScalable*Enabled flag should be temporary

Two things need to happen before we can enable "scalable vectorization" by default.

Perf parity between NEON and SVE. We might actually achieve that once data tiling supports SVE.

Stress testing of SVE. We already do that to a degree downstream, but we need to do more :)

In the meantime, we do need to keep isScalableVectorizationEnabled. Even when SVE reaches maturity, we may still want to keep a flag to fall back to NEON (e.g. for experimentation or to prototype new features).

the temporary SME flag should probably come next to it.

We enable SME when:

Scalable vectorization is "ON".

+sme is present as a target feature.

banach-space · 2025-07-15T12:18:43Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp

+  // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
+  // and the temporary SME flag should probably come next to it.
+  if (!isAArch64(config) || !isScalableVectorizationEnabled())
+    return failure();


Perhaps ...

Suggested change

return failure();

return notifyMatchFailure("Pre-conditions to enable scalable tiling are not met");

Put this in a debug print :) Hope that also works!

You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")

iree/compiler/src/iree/compiler/Codegen/Common/MaterializeEncoding.cpp

Lines 33 to 35 in 9e8691f

#define DEBUG_TYPE "iree-codegen-materialize-encoding"

#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")

#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

banach-space · 2025-07-15T12:21:19Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp

+  // TODO(egebeysel): Choosing between SME
+  // & SVE should also probably reside around here. I guess we could also just
+  // choose SME and then have the fallback option implemented with
+  // 2DScalableTo1D pass and also even from 1D to non-scalable. Currently,
+  // hard-coded to pick SVE since SME with DT is not yet supported.


I would trim this comment:

Suggested change

// TODO(egebeysel): Choosing between SME

// & SVE should also probably reside around here. I guess we could also just

// choose SME and then have the fallback option implemented with

// 2DScalableTo1D pass and also even from 1D to non-scalable. Currently,

// hard-coded to pick SVE since SME with DT is not yet supported.

// TODO(egebeysel): Add logic for SME.

Basically, the following means that SME is on and should be used instead of SVE:

`--iree-global-opt-enable-early-materialization=false --iree-llvmcpu-target-cpu-features="+sve,+sme".

We shouldn't require any other logic. With this in mind, since we don't support SME yet, it would be good to bail out if +sme is added on the command line.

Currently doing this, one thing though - do we need the +sve flag for SME or is +sme enough by itself? Because ... Apple M4

Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.

Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.

Oh thanks! So you're saying if we need both of the flags for SME (codegen) to be on, then I should name them both. That makes sense. The next question would be if we have such implication between SME & SVE, i.e. if we need +sve for SME to be on. Because AFAIK, M4 only happens to support SME and not SVE :)

Assuming that M4 does support both SVE and SME, the answer is that yes you should specify both, in the same way as it is useful to mention +i8mm,+dotprod and not just +i8mm. On x86 we have a worse version of this where we mention not only all AVX-512 supported features but also AVX, AVX2, FMA below that.

I don't believe it to be even possible for a CPU to support SME but not SVE. SME instructions have register operands that are SVE registers Z0-Z31, used for LHS and RHS tiles, separately from the ZA registers introduced by SME for the accumulator. You could check the armv9 architecture reference manual and I would expect it to say that SME implies SVE.

One way out of that is, instead of passing --iree-llvmcpu-target-cpu-features=..., you can pass --iree-llvmcpu-target-cpu=apple-m4. It is up to each CPU architecture to implement the semantics of that flag so on some CPU architectures this will fail to populate CPU features. But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.

On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen. In practice it's up to each LLVM target backend.

I don't believe it to be even possible for a CPU to support SME but not SVE

It is :)

Z0-Z31 registers from SME may (but don't have to) overlap with Z0-Z31 registers from SVE. And M4 is one example where SME was added while SVE was left out (see Wikipedia for Apple_M4).

On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen.

Side note, I am told that for big OoO cores, LLVM scheduling models are not really used/relevant.

I appreciate that I am not providing any "official" links here, but most of these things are not so visibly documented 😅

do we need the +sve flag for SME or is +sme enough by itself?

Given the above, on M4 you would only use +sme. Or, even better, what @bjacob suggested:

But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.

Thanks @banach-space , I stand corrected!

compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp

compiler/src/iree/compiler/Codegen/ExternalInterfaces/Utils.h

banach-space · 2025-07-15T12:36:56Z

compiler/src/iree/compiler/Dialect/Stream/Transforms/test/specialize_encodings.mlir

@@ -1,4 +1,5 @@
 // RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --verify-diagnostics %s | FileCheck %s
+// RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --iree-llvmcpu-enable-scalable-vectorization=true --verify-diagnostics %s | FileCheck %s --check-prefix=VSCALE


[nit] Here and in other places, avoid VSCALE as a check-prefix, it clashes with vector.vscale and VSCALE LIT variable. Perhaps SCALABLE_VEC?

compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir

banach-space · 2025-07-15T12:46:00Z

compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir

 #encoding_result = #iree_encoding.encoding<operand_index = 2, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], iteration_sizes = [?, ?, ?]>
 func.func @matmul_lowering_f32f32f32_aarch64() attributes {
-  hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>
+  hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {cpu_features="+sve", target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>


OK, so with this change and with the new RUN line, you demonstrate that:

+sve alone is not sufficient to get scalable tile sizes.

Makes sense.

compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir

egebeysel · 2025-07-17T17:36:33Z

For reviewers, would be nice if you could toggle the "Hide Whitespace" option for the tests:

compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir

hanhanW

The change itself on materialization looks okay to me. How to lower vscale op or query the value would be tricky on host compilation.

Query the "device" to get the value of vscale. IREE would do it either at compile or at run-time - the latter would be preferable, but then we would definitely loose const-eval, right?

If we can't query it in compile-time, we can only do it in initialization phase.

Provide a compile-time flag/option/mechanism to specify vscale. This goes against the Vector Length Agnostic approach (we've avoided specifying vscale at compile time thus far), but would unblock us and allow us to experiment/iterate. This would allow us to keep const-eval, right?

Yes, it should allow us to keep const-eval, because you can interpret the scalable flags in terms of concrete layouts.

Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like 16x16 -> 2x?x8x[8]. Essentially, the part that decides if we can do const-eval is the layout that the data actually has to be in and packing the data as if we have vscale=16 regardless of the actual vscale value (and padding the rest) would give you const-eval, e.g. if we the above example results in the "static" packing of 16x16 -> 2x1x8x64. Although I have a hunch that this would actually give us worst performance than no data-tiling.

On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. ---DATA----******************* where * denotes unused memory should be valid. This lets us avoid making the "host-side" aware of the actual value of vscale and allocates a definitely big enough buffer.

So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d.

That's my current impression, please correct me if I'm mistaken or overseeing something.

yeah, I think const-eval is still not available if vscale can only be queried during runtime. E.g., if you don't have the target device, how do you get the vscale value in compile-time?

A potential solution may be dup the weights several times (with different vscale values) and use the corresponding based on vscale query, but it make memory usage larger; we are far to experiment it.

hanhanW · 2025-07-17T23:44:51Z

compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.cpp

Can you add a test to https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/unittests/UtilsTest.cpp?

Generally, there are limited unit tests in LLVM/MLIR-based projects. However, it is tricky to separate the tests from stream, because it is the intention of encoding specialization. Perhaps we can create a testing pass in the future. For now, I'd add few unit tests for it because it is a generic util/struct.

https://llvm.org/docs/TestingGuide.html#unit-tests

Added one, thanks for the pointer. I didn't realise I forgot to implement some operator overloads for the scalable tiles, added them as well!

hanhanW · 2025-07-17T23:46:36Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp

+  // TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
+  // and the temporary SME flag should probably come next to it.
+  if (!isAArch64(config) || !isScalableVectorizationEnabled())
+    return failure();


You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")

iree/compiler/src/iree/compiler/Codegen/Common/MaterializeEncoding.cpp

Lines 33 to 35 in 9e8691f

#define DEBUG_TYPE "iree-codegen-materialize-encoding"

#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")

#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

hanhanW · 2025-07-17T23:48:06Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/Utils.h

+          // For shape inference, mark scalable tiles as dynamic sizes for the
+          // shape inference. Note, scalable tiles that are represented with
+          // static inner tile sizes.


nit: remove dup words. (please format the comment. I can't format it easily because I edited it on web UI.)

Suggested change

// For shape inference, mark scalable tiles as dynamic sizes for the

// shape inference. Note, scalable tiles that are represented with

// static inner tile sizes.

// Mark scalable tiles as dynamic sizes for the

// shape inference. Note, scalable tiles that are represented with

// static inner tile sizes.

Signed-off-by: Ege Beysel <[email protected]>

egebeysel · 2025-07-18T12:06:44Z

How to lower vscale op or query the value would be tricky on host compilation.

Thanks for the review and for your comments! I opened an issue for the storage size calculation and const-evaluation which IIUC entail the issues you mention in your other comments as well. I think we should keep this PR for the materialisation only and discuss them on the issues.

If you're talking about another issue about host-compilation, I'd be happy to discuss it offline or in an issue as well :)

For now, this should not change the behaviour for data-tiling without explicitly enabling SVE and SVE without data-tiling. To make this work end-to-end, I'll split the remaining issues and address them in separate PRs.

Therefore, I'd humbly ask for one more review after addressing the comments :) Thanks!

Also JFYI, I unfortunately cannot trigger the CI runs by myself, so feel free to turn on automatic merging if you were to approve the PR :)

… add unit tests Signed-off-by: Ege Beysel <[email protected]>

hanhanW

Just two nits, I'll land the PR after they are fixed.

hanhanW · 2025-07-18T17:35:53Z

compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/unittests/UtilsTest.cpp

+  auto extractedScalableTiles = llvm::to_vector(llvm::map_range(
+      cast<ArrayAttr>(dictAttr.getNamed("scalableTiles")->getValue()),
+      [](Attribute a) { return cast<BoolAttr>(a).getValue(); }));


I think what you're looking for is using llvm::map_to_vector.

Thanks, exactly :)

hanhanW · 2025-07-18T17:37:15Z

compiler/src/iree/compiler/Codegen/ExternalInterfaces/Utils.h

+          // Mark scalable tiles as dynamic sizes for the
+          // shape inference. Note, scalable tiles that are represented with
+          // static inner tile sizes.


Can you format it? I think shape inference can be at the first line?

done, sorry I forgot!

Signed-off-by: Ege Beysel <[email protected]>

banach-space

Thanks!

egebeysel · 2025-07-21T16:41:18Z

Thanks for the reviews, just a quick ping since I can't merge :) @hanhanW @banach-space

hanhanW · 2025-07-21T17:56:56Z

Thanks, I landed the change with removing the CC + adding two tracking issue ID in the commit description.

hanhanW

Sorry that I did not see the issue in the first place. I think we need a few more tests that materializes set_encoding/unset_encoding into pack/unpack with scalable variants.

Can you help add test cases in a follow-up? I think we can do something similar to the above, but switches it to aarch64 version.

iree/compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_x86_64.mlir

Lines 3 to 18 in 8f61580

    
           #pipeline_layout = #hal.pipeline.layout<bindings = [ 
        
             #hal.pipeline.binding<storage_buffer>, 
        
             #hal.pipeline.binding<storage_buffer> 
        
           ]> 
        
           #encoding = #iree_encoding.encoding<operand_index = 0, op_type = matmul, element_types = [bf16, bf16, bf16], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [1, 1000, ?]> 
        
           func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes { 
        
             hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}> 
        
           }{ 
        
             %c0 = arith.constant 0 : index 
        
             %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> 
        
             %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>> 
        
             %2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> -> tensor<1x1000xbf16> 
        
             %3 = iree_encoding.set_encoding %2 : tensor<1x1000xbf16> -> tensor<1x1000xbf16, #encoding> 
        
             iree_tensor_ext.dispatch.tensor.store %3, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xbf16, #encoding> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16,  #encoding>> 
        
             return 
        
           }

Note: we can't return the result of set encoding directly, because we don't have a pattern to materialize encodings for func ops. It is on my TODO list, but I never get cycles for this: #20825

banach-space · 2025-07-31T16:05:20Z

Can you help add test cases in a follow-up?

Oh yeah, I see what's missing. Let me take care of this.

EDIT: #21560

Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable.

Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. Signed-off-by: Andrzej Warzynski <[email protected]>

) Since #21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]>

…e-org#21560) Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]>

…r mmt4d (iree-org#21304) This PR enables encoding materialization/specialization for mmt4d. It is a step towards iree-org#16162. This PR is the first out of N to follow on that regard. Currently, materialization and specialization of mmt4d encodings are supported, although this does not yet work e2e. The current tiling setup does not yet out-of-the box support scalable inner tiles. The inner tile sizes that we choose here are supposed to be scalability-agnostic, although we should definitely go back to this on another issue and empirically support this. Generally, this works for device encoding materialization, but I'm not entirely sure how this would do for host encodings. I'd appreciate your opinions/guidance on that regard. I'm not entirely sure if we want to allow vscale ops to reside in host side code. I've also done some investigation on enabling this e2e, however there are some issues and I will create separate issues with reproducers for them ASAIC. But a quick sneak peek: - Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side. Tracking issue: iree-org#21317 - Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :) Tracking issue: iree-org#21319 - Tile size and scalable tile flag propagation in dispatches with pack/unpack/mmt4d: the tile sizes and (more importantly) scalable tile flag information should be correctly propagated to other ops in the same dispatch. --------- Signed-off-by: Ege Beysel <[email protected]> Signed-off-by: keshavvinayak01 <[email protected]>

…e-org#21560) Since iree-org#21304, encoding materialisation supports scalable tile sizes. This patch adds tests for that logic that we missed in the original revision. The newly added tests mirror similar tests for X86, with the added emphasis on scalable tile sizes. Note, ATM we only make the "N" dimension scalable, which matches SVE. For SME (TODO), we will make both "N" and "M" dimensions scalable. --------- Signed-off-by: Andrzej Warzynski <[email protected]> Signed-off-by: keshavvinayak01 <[email protected]>

egebeysel requested review from MaheshRavishankar, benvanik, hanhanW and pashu123 as code owners July 8, 2025 15:04

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from a7f6352 to 53e254d Compare July 8, 2025 15:05

hanhanW requested a review from bjacob July 8, 2025 16:15

egebeysel mentioned this pull request Jul 8, 2025

[mlir][linalg] Add support for scalable vectorization of linalg.mmt4d llvm/llvm-project#146531

Merged

This was referenced Jul 9, 2025

[CPU][DT] SVE: Storage Size Calculation #21317

Open

[CPU][DT] SVE: Const-eval of Weights/Constants & initializers #21319

Open

banach-space reviewed Jul 15, 2025

View reviewed changes

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch 3 times, most recently from d996995 to a10d800 Compare July 17, 2025 17:33

banach-space reviewed Jul 17, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/test/materialize_encoding_aarch64.mlir Show resolved Hide resolved

hanhanW reviewed Jul 18, 2025

View reviewed changes

egebeysel added 2 commits July 18, 2025 08:34

[DT][SVE] DT support for scalable tiles - encoding materialization

28aa172

Signed-off-by: Ege Beysel <[email protected]>

move scalable vectorization flag & utility func

3760bef

Signed-off-by: Ege Beysel <[email protected]>

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from a10d800 to 94d3d53 Compare July 18, 2025 11:57

egebeysel requested review from banach-space and hanhanW July 18, 2025 12:06

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch 2 times, most recently from cf5f593 to 23de490 Compare July 18, 2025 12:23

address comments: add scalable tile utilities, refactor lit tests and…

aedf866

… add unit tests Signed-off-by: Ege Beysel <[email protected]>

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 23de490 to 3d6d425 Compare July 18, 2025 12:27

hanhanW approved these changes Jul 18, 2025

View reviewed changes

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 3d6d425 to 10d9d20 Compare July 18, 2025 19:05

hanhanW enabled auto-merge (squash) July 18, 2025 20:33

add annotation to materialization tests & format

e14df68

Signed-off-by: Ege Beysel <[email protected]>

auto-merge was automatically disabled July 19, 2025 10:22
Head branch was pushed to by a user without write access

egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 10d9d20 to e14df68 Compare July 19, 2025 10:22

banach-space approved these changes Jul 21, 2025

View reviewed changes

hanhanW merged commit 00eac3a into iree-org:main Jul 21, 2025
44 checks passed

hanhanW reviewed Jul 31, 2025

View reviewed changes

banach-space mentioned this pull request Aug 1, 2025

[CPU][AArch64][Test] Add more tests for encoding materialisation #21560

Merged

egebeysel mentioned this pull request Aug 15, 2025

[DT][SVE] adjust tile sizes for mmt4d & disable transposition of narrow-N matmuls #21701

Open

		// TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
		// and the temporary SME flag should probably come next to it.

	return failure();
	return notifyMatchFailure("Pre-conditions to enable scalable tiling are not met");

	#define DEBUG_TYPE "iree-codegen-materialize-encoding"
	#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
	#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

		@@ -1,4 +1,5 @@
		// RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --verify-diagnostics %s \| FileCheck %s
		// RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --iree-llvmcpu-enable-scalable-vectorization=true --verify-diagnostics %s \| FileCheck %s --check-prefix=VSCALE

	#pipeline_layout = #hal.pipeline.layout<bindings = [
	#hal.pipeline.binding<storage_buffer>,
	#hal.pipeline.binding<storage_buffer>
	]>
	#encoding = #iree_encoding.encoding<operand_index = 0, op_type = matmul, element_types = [bf16, bf16, bf16], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [1, 1000, ?]>
	func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes {
	hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>
	}{
	%c0 = arith.constant 0 : index
	%0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>>
	%1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>>
	%2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> -> tensor<1x1000xbf16>
	%3 = iree_encoding.set_encoding %2 : tensor<1x1000xbf16> -> tensor<1x1000xbf16, #encoding>
	iree_tensor_ext.dispatch.tensor.store %3, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xbf16, #encoding> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>>
	return
	}

[DT][SVE] DT support for scalable tiles - encoding materialization for mmt4d #21304

[DT][SVE] DT support for scalable tiles - encoding materialization for mmt4d #21304

Uh oh!

Conversation

egebeysel commented Jul 8, 2025

Uh oh!

hanhanW commented Jul 8, 2025

Uh oh!

hanhanW commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egebeysel commented Jul 8, 2025

Uh oh!

banach-space commented Jul 8, 2025

Uh oh!

egebeysel commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

banach-space left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjacob Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

egebeysel commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egebeysel commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hanhanW commented Jul 8, 2025 •

edited

Loading

egebeysel commented Jul 9, 2025 •

edited

Loading

bjacob Jul 17, 2025 •

edited

Loading

egebeysel commented Jul 17, 2025 •

edited

Loading

egebeysel commented Jul 18, 2025 •

edited

Loading

banach-space commented Jul 31, 2025 •

edited

Loading