Skip to content

Conversation

@egebeysel
Copy link
Contributor

This PR enables encoding materialization/specialization for mmt4d. It is a step towards #16162. This PR is the first out of N to follow on that regard.

Currently, materialization and specialization of mmt4d encodings are supported, although this does not yet work e2e. The current tiling setup does not yet out-of-the box support scalable inner tiles. The inner tile sizes that we choose here are supposed to be scalability-agnostic, although we should definitely go back to this on another issue and empirically support this.

Generally, this works for device encoding materialization, but I'm not entirely sure how this would do for host encodings. I'd appreciate your opinions/guidance on that regard. I'm not entirely sure if we want to allow vscale ops to reside in host side code.

I've also done some investigation on enabling this e2e, however there are some issues and I will create separate issues with reproducers for them ASAIC. But a quick sneak peek:

  • Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side.
  • Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :)
  • Tile size and scalable tile flag propagation in dispatches with pack/unpack/mmt4d: the tile sizes and (more importantly) scalable tile flag information should be correctly propagated to other ops in the same dispatch.

cc @banach-space

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from a7f6352 to 53e254d Compare July 8, 2025 15:05
@hanhanW hanhanW requested a review from bjacob July 8, 2025 16:15
@hanhanW
Copy link
Contributor

hanhanW commented Jul 8, 2025

Thanks for the PR, it does remind me to review llvm/llvm-project#146531; I'll take a look today.

Storage size calculation of tensor types with scalable tiles: here, we need the vscale op to correctly estimate the size of the packed tensor - which AFAICS ends up in the host side.

I'm not an expert here, but I feel that it could be off if the host and device are not running on the same device. How does vscale work in practice? My understanding is that it is a runtime constant depending on execution target. I think the host side needs to query the value somehow; we likely need a new mechanism for propagating vscale value. I have few ideas here and other ideas are welcome:

  1. We describe the cpu features in executable_target, do we know the vscale for the given target during compilation time? If so, we can probably attach such information to the config and use it in specialization on host side.
  2. Maybe we need a mechanism to propagate such runtime constant to host (and vice versa if we want to provide the number of available threads to device for distribution, but it is a different topic), so the host can use the value to do the allocation.

Since we're making the N dim scalable, we lose the ability to const-eval the weights constants and vscale ops end up in initializers, which AFAICS is not supported right now. I'm not sure if this would be a problem or not, just need opinions on how to deal with this :)

If (1) is doable, you can always replace the vscale with a specific number, and we'll be able to do const-eval (not ready yet). If (1) is not doable and (2) is doable, we may only do it in initializer, but not const-eval.

@hanhanW
Copy link
Contributor

hanhanW commented Jul 8, 2025

I think the padding size depends on vscale, so maybe the easiest way to unblock the prototype is using maximal value for vscale on host side.

@egebeysel
Copy link
Contributor Author

Thanks for the feedback!

  1. We describe the cpu features in executable_target, do we know the vscale for the given target during compilation time? If so, we can probably attach such information to the config and use it in specialization on host side.

Technically speaking, every target that implements SVE would implement a specific vector length, varying between 128-2048 and vscale is the factor to 128. Therefore, this value is known and fixed at runtime. Although the point of having scalable vectors in the first place IIUC is doing vector-length agnostic programming. So although it is a step forwards to get and propagate the specific vscale value for a specific target, the goal should IMO be to keep this as a runtime value. On that regard:

2. Maybe we need a mechanism to propagate such runtime constant to host (and vice versa if we want to provide the number of available threads to device for distribution, but it is a different topic), so the host can use the value to do the allocation.

seems to be a more on point to me :)

(2) is doable, we may only do it in initializer, but not const-eval.

I think if we were to keep the value of vscale a runtime constant, we cannot do const-eval. But I had a question, is the initializer run on the "host" side or "device" side? Nevertheless, I think the vscale op is illegal for the initializer as of now, so that's an action item :) I'll mark that on the issue I'll be creating.

@banach-space
Copy link
Collaborator

Thank you for starting the discussion @egebeysel and for helping with this super exciting project! :)

My understanding is that it is a runtime constant depending on execution target.

Yes.

I think the host side needs to query the value somehow; we likely need a new mechanism for propagating vscale value.

Indeed. However, there could be an intermediate step that would work well for now - we make the host query itself. Could that work?

This would only make sense when "host" and "device" are physically identical - ATM that's the case for the devices that we are targeting. However, it wouldn't work for e.g. M4 (there's SME, but no SVE).

I think the padding size depends on vscale, so maybe the easiest way to unblock the prototype is using maximal value for vscale on host side.

Yes, but this would mean that we assume that there's 64 i32 values per vec register. That's instead of 4 that you get for implementations that have vectors that are 128 bits wide. So, we would potentially allocating much more then needed in practice. IMO this would be OK as a stepping stone, but not as a long-term solution.

So although it is a step forwards to get and propagate the specific vscale value for a specific target, the goal should IMO be to keep this as a runtime value.

+100

Ideally, IREE would allow 3 options:

  1. Query the "device" to get the value of vscale. IREE would do it either at compile or at run-time - the latter would be preferable, but then we would definitely loose const-eval, right?
  2. Provide a compile-time flag/option/mechanism to specify vscale. This goes against the Vector Length Agnostic approach (we've avoided specifying vscale at compile time thus far), but would unblock us and allow us to experiment/iterate. This would allow us to keep const-eval, right?
  3. Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

To me, [1] feels like the most natural approach, but also most involved and we may see perf degradation. [2] and [3] are good stepping stones. It's hard to pick a "favourite" without more perf data.

@egebeysel
Copy link
Contributor Author

egebeysel commented Jul 9, 2025

3. Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like 16x16 -> 2x?x8x[8]. Essentially, the part that decides if we can do const-eval is the layout that the data actually has to be in and packing the data as if we have vscale=16 regardless of the actual vscale value (and padding the rest) would give you const-eval, e.g. if we the above example results in the "static" packing of 16x16 -> 2x1x8x64. Although I have a hunch that this would actually give us worst performance than no data-tiling.

On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. ---DATA----******************* where * denotes unused memory should be valid. This lets us avoid making the "host-side" aware of the actual value of vscale and allocates a definitely big enough buffer.

So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d.

That's my current impression, please correct me if I'm mistaken or overseeing something.

Indeed. However, there could be an intermediate step that would work well for now - we make the host query itself. Could that work?

I also think that this would be good to have to unblock further progression.

Also FYI: I'm drafting the issue for this and will ping you on it to move the discussion there.

Copy link
Collaborator

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @egebeysel , from the SVE perspective, this encoding materialisation makes sense to me. Your approach, i..e the implementation also looks good to me, but I will defer to experts to confirm/review, CC @hanhanW .

I have left a number of comments, but that's mostly to help with the wider SVE context in various comments, and to set the direction.

IIUC, we can proceed with this while discussing the other issues separately:

as in, I don't see anything in those issue that would block this. Please let me know if I missed something. Thanks!

Comment on lines +70 to +73
// TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
// and the temporary SME flag should probably come next to it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isScalable*Enabled flag should be temporary

Two things need to happen before we can enable "scalable vectorization" by default.

  • Perf parity between NEON and SVE. We might actually achieve that once data tiling supports SVE.
  • Stress testing of SVE. We already do that to a degree downstream, but we need to do more :)

In the meantime, we do need to keep isScalableVectorizationEnabled. Even when SVE reaches maturity, we may still want to keep a flag to fall back to NEON (e.g. for experimentation or to prototype new features).

the temporary SME flag should probably come next to it.

We enable SME when:

  • Scalable vectorization is "ON".
  • +sme is present as a target feature.

// TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
// and the temporary SME flag should probably come next to it.
if (!isAArch64(config) || !isScalableVectorizationEnabled())
return failure();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps ...

Suggested change
return failure();
return notifyMatchFailure("Pre-conditions to enable scalable tiling are not met");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in a debug print :) Hope that also works!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")

#define DEBUG_TYPE "iree-codegen-materialize-encoding"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

Comment on lines 82 to 86
// TODO(egebeysel): Choosing between SME
// & SVE should also probably reside around here. I guess we could also just
// choose SME and then have the fallback option implemented with
// 2DScalableTo1D pass and also even from 1D to non-scalable. Currently,
// hard-coded to pick SVE since SME with DT is not yet supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trim this comment:

Suggested change
// TODO(egebeysel): Choosing between SME
// & SVE should also probably reside around here. I guess we could also just
// choose SME and then have the fallback option implemented with
// 2DScalableTo1D pass and also even from 1D to non-scalable. Currently,
// hard-coded to pick SVE since SME with DT is not yet supported.
// TODO(egebeysel): Add logic for SME.

Basically, the following means that SME is on and should be used instead of SVE:

  • `--iree-global-opt-enable-early-materialization=false --iree-llvmcpu-target-cpu-features="+sve,+sme".

We shouldn't require any other logic. With this in mind, since we don't support SME yet, it would be good to bail out if +sme is added on the command line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently doing this, one thing though - do we need the +sve flag for SME or is +sme enough by itself? Because ... Apple M4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, when naming individual CPU target features, name them all / do not rely in implications between them. In principle we could propagate implied features, but in practice I don't believe that to be currently, consistently happening.

Oh thanks! So you're saying if we need both of the flags for SME (codegen) to be on, then I should name them both. That makes sense. The next question would be if we have such implication between SME & SVE, i.e. if we need +sve for SME to be on. Because AFAIK, M4 only happens to support SME and not SVE :)

Copy link
Collaborator

@bjacob bjacob Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that M4 does support both SVE and SME, the answer is that yes you should specify both, in the same way as it is useful to mention +i8mm,+dotprod and not just +i8mm. On x86 we have a worse version of this where we mention not only all AVX-512 supported features but also AVX, AVX2, FMA below that.

  • I don't believe it to be even possible for a CPU to support SME but not SVE. SME instructions have register operands that are SVE registers Z0-Z31, used for LHS and RHS tiles, separately from the ZA registers introduced by SME for the accumulator. You could check the armv9 architecture reference manual and I would expect it to say that SME implies SVE.

One way out of that is, instead of passing --iree-llvmcpu-target-cpu-features=..., you can pass --iree-llvmcpu-target-cpu=apple-m4. It is up to each CPU architecture to implement the semantics of that flag so on some CPU architectures this will fail to populate CPU features. But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.

On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen. In practice it's up to each LLVM target backend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it to be even possible for a CPU to support SME but not SVE

It is :)

Z0-Z31 registers from SME may (but don't have to) overlap with Z0-Z31 registers from SVE. And M4 is one example where SME was added while SVE was left out (see Wikipedia for Apple_M4).

On top of enabling CPU features, this enables LLVM scheduling models for the given CPU, so in theory this results in better codegen.

Side note, I am told that for big OoO cores, LLVM scheduling models are not really used/relevant.

I appreciate that I am not providing any "official" links here, but most of these things are not so visibly documented 😅

do we need the +sve flag for SME or is +sme enough by itself?

Given the above, on M4 you would only use +sme. Or, even better, what @bjacob suggested:

But on Aarch64, this has been implemented in #19881 so you should be able to just pass --iree-llvmcpu-target-cpu=apple-m4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @banach-space , I stand corrected!

@@ -1,4 +1,5 @@
// RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --verify-diagnostics %s | FileCheck %s
// RUN: iree-opt --split-input-file --pass-pipeline='builtin.module(iree-stream-specialize-encodings)' --iree-llvmcpu-enable-scalable-vectorization=true --verify-diagnostics %s | FileCheck %s --check-prefix=VSCALE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Here and in other places, avoid VSCALE as a check-prefix, it clashes with vector.vscale and VSCALE LIT variable. Perhaps SCALABLE_VEC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#encoding_result = #iree_encoding.encoding<operand_index = 2, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], iteration_sizes = [?, ?, ?]>
func.func @matmul_lowering_f32f32f32_aarch64() attributes {
hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>
hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {cpu_features="+sve", target_triple="aarch64-xyz-xyz", ukernels = "all", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so with this change and with the new RUN line, you demonstrate that:

  • +sve alone is not sufficient to get scalable tile sizes.

Makes sense.

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch 3 times, most recently from d996995 to a10d800 Compare July 17, 2025 17:33
@egebeysel
Copy link
Contributor Author

egebeysel commented Jul 17, 2025

For reviewers, would be nice if you could toggle the "Hide Whitespace" option for the tests:
image

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself on materialization looks okay to me. How to lower vscale op or query the value would be tricky on host compilation.

  1. Query the "device" to get the value of vscale. IREE would do it either at compile or at run-time - the latter would be preferable, but then we would definitely loose const-eval, right?

If we can't query it in compile-time, we can only do it in initialization phase.

  1. Provide a compile-time flag/option/mechanism to specify vscale. This goes against the Vector Length Agnostic approach (we've avoided specifying vscale at compile time thus far), but would unblock us and allow us to experiment/iterate. This would allow us to keep const-eval, right?

Yes, it should allow us to keep const-eval, because you can interpret the scalable flags in terms of concrete layouts.

  1. Conservatively assume that vectors are 2048 bits wide. While this would mean over-allocating in most cases known today (the widest SVE implementation that I am aware of is 512 bits), it would be safe, easy to implement and keep access to const-eval, right?

IIUC, there's storage size calculation - and the actual layout of data resulting from the value of vscale, e.g. if the resulting layout from the packing looks like 16x16 -> 2x?x8x[8]. Essentially, the part that decides if we can do const-eval is the layout that the data actually has to be in and packing the data as if we have vscale=16 regardless of the actual vscale value (and padding the rest) would give you const-eval, e.g. if we the above example results in the "static" packing of 16x16 -> 2x1x8x64. Although I have a hunch that this would actually give us worst performance than no data-tiling.

On the other hand, storage size calculation decides on how big the buffer would be that's allocated for the dispatch itself, but does not impose anything regarding how much of that the dispatch actually uses, how the data is laid out inside that buffer etc., meaning that we can overestimate and allocate a bigger buffer, but then use a portion of it, e.g. ---DATA----******************* where * denotes unused memory should be valid. This lets us avoid making the "host-side" aware of the actual value of vscale and allocates a definitely big enough buffer.

So as long as we want to pack the data into the more efficient layout, we lose const-eval anyways - unless we don't have the scalable dimension corresponding to the weights/constants (RHS in most cases) but rather to the other non-constant input of mmt4d.

That's my current impression, please correct me if I'm mistaken or overseeing something.

yeah, I think const-eval is still not available if vscale can only be queried during runtime. E.g., if you don't have the target device, how do you get the vscale value in compile-time?

A potential solution may be dup the weights several times (with different vscale values) and use the corresponding based on vscale query, but it make memory usage larger; we are far to experiment it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test to https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/unittests/UtilsTest.cpp?

Generally, there are limited unit tests in LLVM/MLIR-based projects. However, it is tricky to separate the tests from stream, because it is the intention of encoding specialization. Perhaps we can create a testing pass in the future. For now, I'd add few unit tests for it because it is a generic util/struct.

https://llvm.org/docs/TestingGuide.html#unit-tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one, thanks for the pointer. I didn't realise I forgot to implement some operator overloads for the scalable tiles, added them as well!

// TODO(egebeysel): I think this isScalable*Enabled flag should be temporary
// and the temporary SME flag should probably come next to it.
if (!isAArch64(config) || !isScalableVectorizationEnabled())
return failure();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define a LDBG macro which simplifies the code. E.g., it can be LDBG("Pre-conditions to enable scalable tiling are not met!")

#define DEBUG_TYPE "iree-codegen-materialize-encoding"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

Comment on lines 77 to 79
// For shape inference, mark scalable tiles as dynamic sizes for the
// shape inference. Note, scalable tiles that are represented with
// static inner tile sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove dup words. (please format the comment. I can't format it easily because I edited it on web UI.)

Suggested change
// For shape inference, mark scalable tiles as dynamic sizes for the
// shape inference. Note, scalable tiles that are represented with
// static inner tile sizes.
// Mark scalable tiles as dynamic sizes for the
// shape inference. Note, scalable tiles that are represented with
// static inner tile sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from a10d800 to 94d3d53 Compare July 18, 2025 11:57
@egebeysel
Copy link
Contributor Author

egebeysel commented Jul 18, 2025

How to lower vscale op or query the value would be tricky on host compilation.

Thanks for the review and for your comments! I opened an issue for the storage size calculation and const-evaluation which IIUC entail the issues you mention in your other comments as well. I think we should keep this PR for the materialisation only and discuss them on the issues.

If you're talking about another issue about host-compilation, I'd be happy to discuss it offline or in an issue as well :)

For now, this should not change the behaviour for data-tiling without explicitly enabling SVE and SVE without data-tiling. To make this work end-to-end, I'll split the remaining issues and address them in separate PRs.

Therefore, I'd humbly ask for one more review after addressing the comments :) Thanks!

Also JFYI, I unfortunately cannot trigger the CI runs by myself, so feel free to turn on automatic merging if you were to approve the PR :)

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch 2 times, most recently from cf5f593 to 23de490 Compare July 18, 2025 12:23
@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 23de490 to 3d6d425 Compare July 18, 2025 12:27
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nits, I'll land the PR after they are fixed.

Comment on lines 169 to 171
auto extractedScalableTiles = llvm::to_vector(llvm::map_range(
cast<ArrayAttr>(dictAttr.getNamed("scalableTiles")->getValue()),
[](Attribute a) { return cast<BoolAttr>(a).getValue(); }));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're looking for is using llvm::map_to_vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, exactly :)

Comment on lines 77 to 79
// Mark scalable tiles as dynamic sizes for the
// shape inference. Note, scalable tiles that are represented with
// static inner tile sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you format it? I think shape inference can be at the first line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, sorry I forgot!

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 3d6d425 to 10d9d20 Compare July 18, 2025 19:05
@hanhanW hanhanW enabled auto-merge (squash) July 18, 2025 20:33
auto-merge was automatically disabled July 19, 2025 10:22

Head branch was pushed to by a user without write access

@egebeysel egebeysel force-pushed the egb.sve_dt_materialization_pt1 branch from 10d9d20 to e14df68 Compare July 19, 2025 10:22
Copy link
Collaborator

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@egebeysel
Copy link
Contributor Author

Thanks for the reviews, just a quick ping since I can't merge :) @hanhanW @banach-space

@hanhanW hanhanW merged commit 00eac3a into iree-org:main Jul 21, 2025
44 checks passed
@hanhanW
Copy link
Contributor

hanhanW commented Jul 21, 2025

Thanks, I landed the change with removing the CC + adding two tracking issue ID in the commit description.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I did not see the issue in the first place. I think we need a few more tests that materializes set_encoding/unset_encoding into pack/unpack with scalable variants.

Can you help add test cases in a follow-up? I think we can do something similar to the above, but switches it to aarch64 version.

#pipeline_layout = #hal.pipeline.layout<bindings = [
#hal.pipeline.binding<storage_buffer>,
#hal.pipeline.binding<storage_buffer>
]>
#encoding = #iree_encoding.encoding<operand_index = 0, op_type = matmul, element_types = [bf16, bf16, bf16], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [1, 1000, ?]>
func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes {
hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>}>
}{
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>>
%2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<1x1000xbf16>> -> tensor<1x1000xbf16>
%3 = iree_encoding.set_encoding %2 : tensor<1x1000xbf16> -> tensor<1x1000xbf16, #encoding>
iree_tensor_ext.dispatch.tensor.store %3, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xbf16, #encoding> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<1x1000xbf16, #encoding>>
return
}

Note: we can't return the result of set encoding directly, because we don't have a pattern to materialize encodings for func ops. It is on my TODO list, but I never get cycles for this: #20825

@banach-space
Copy link
Collaborator

banach-space commented Jul 31, 2025

Can you help add test cases in a follow-up?

Oh yeah, I see what's missing. Let me take care of this.

EDIT: #21560

banach-space added a commit to banach-space/iree that referenced this pull request Aug 1, 2025
Since iree-org#21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.
banach-space added a commit to banach-space/iree that referenced this pull request Aug 1, 2025
Since iree-org#21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.

Signed-off-by: Andrzej Warzynski <[email protected]>
banach-space added a commit to banach-space/iree that referenced this pull request Aug 1, 2025
Since iree-org#21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.

Signed-off-by: Andrzej Warzynski <[email protected]>
banach-space added a commit that referenced this pull request Aug 5, 2025
)

Since #21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.

---------

Signed-off-by: Andrzej Warzynski <[email protected]>
hhkit pushed a commit to opencompl/iree that referenced this pull request Aug 8, 2025
…e-org#21560)

Since iree-org#21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.

---------

Signed-off-by: Andrzej Warzynski <[email protected]>
keshavvinayak01 pushed a commit to keshavvinayak01/iree that referenced this pull request Sep 4, 2025
…r mmt4d (iree-org#21304)

This PR enables encoding materialization/specialization for mmt4d. It is
a step towards iree-org#16162. This PR is
the first out of N to follow on that regard.

Currently, materialization and specialization of mmt4d encodings are
supported, although this does not yet work e2e. The current tiling setup
does not yet out-of-the box support scalable inner tiles. The inner tile
sizes that we choose here are supposed to be scalability-agnostic,
although we should definitely go back to this on another issue and
empirically support this.

Generally, this works for device encoding materialization, but I'm not
entirely sure how this would do for host encodings. I'd appreciate your
opinions/guidance on that regard. I'm not entirely sure if we want to
allow vscale ops to reside in host side code.

I've also done some investigation on enabling this e2e, however there
are some issues and I will create separate issues with reproducers for
them ASAIC. But a quick sneak peek:

- Storage size calculation of tensor types with scalable tiles: here, we
need the vscale op to correctly estimate the size of the packed tensor -
which AFAICS ends up in the host side. Tracking issue: iree-org#21317
- Since we're making the N dim scalable, we lose the ability to
const-eval the weights constants and vscale ops end up in initializers,
which AFAICS is not supported right now. I'm not sure if this would be a
problem or not, just need opinions on how to deal with this :) Tracking issue: iree-org#21319
- Tile size and scalable tile flag propagation in dispatches with
pack/unpack/mmt4d: the tile sizes and (more importantly) scalable tile
flag information should be correctly propagated to other ops in the same
dispatch.

---------

Signed-off-by: Ege Beysel <[email protected]>
Signed-off-by: keshavvinayak01 <[email protected]>
keshavvinayak01 pushed a commit to keshavvinayak01/iree that referenced this pull request Sep 4, 2025
…e-org#21560)

Since iree-org#21304, encoding materialisation supports scalable tile sizes.
This patch adds tests for that logic that we missed in the original
revision.

The newly added tests mirror similar tests for X86, with the added
emphasis on scalable tile sizes. Note, ATM we only make the "N"
dimension scalable, which matches SVE. For SME (TODO), we will make both
"N" and "M" dimensions scalable.

---------

Signed-off-by: Andrzej Warzynski <[email protected]>
Signed-off-by: keshavvinayak01 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants