[CPU] Switch CPUDoubleTilingExpert pipeline to use IREE::CPU::LoweringConfigAttr. #21354

hanhanW · 2025-07-12T02:06:11Z

The revision switches all the dispatches that use CPUDoubleTilingExpert to IREE::CPU::LoweringConfigAttr, which is a root-based tiling approach. There are two commits in the revision:

The switch for linalg.matmul dispatches: it mainly focuses on the lowering config changes.
The switch for linalg.generic dispatches:
- Changes for lowering configs.
- Update the pipeline that enumerates all the tiling levels.
- Update LLVMCPU2DScalableTo1DScalable pass to use IREE::CPU::LoweringConfigAttr.

The pipeline and lowering config verification only applies on the root op because CPUDoubleTilingExpert expects exactly four levels of tiling. After the switch, only the root op has four levels of tiling. We will need to refresh the verification logic anyway, as they are legacy code and it is easier to have better implementation today. So it will be refreshed in a follow-up.

New known issues:

Trailing vector unit dims are not folded away in the root-op based pipeline, which results in larger binary sizes. Because they are unrolled: [CPU] Vector trailing unit dims are not dropped in root-based tiling pipeline (elem + pack). #21420

hanhanW · 2025-07-12T02:08:03Z

It's ready for review; it depends on

hanhanW · 2025-07-12T02:09:16Z

cc @banach-space @egebeysel since there are many SVE/SME changes. They are NFC in terms of e2e execution. It just switches to the new lowering config.

hanhanW · 2025-07-14T20:14:27Z

I know what's happening. The tilingLevel from TilingConfig is within [0, number_of_level), while we'd like to apply TilingLevel in the pass. E.g., if the cache tile sizes are not set, the 1 tilingLevel should map to TilingLevel::VectorCommonParallelTiles which is 3.

We can either drop the support of cache level tiling, which is performing dummy tiling. Or we can make TileRootOpFuseProducerConsumer takes TilingLevel instead of int64_t; switches all the lowering config to CPU one at the same time. The latter is more like lit tests changes, so it should be fine. I'll try the latter approach.

banach-space · 2025-07-14T20:28:18Z

We can either drop the support of cache level tiling, which is performing dummy tiling. Or we can make TileRootOpFuseProducerConsumer takes TilingLevel instead of int64_t; switches all the lowering config to CPU one at the same time. The latter is more like lit tests changes, so it should be fine. I'll try the latter approach.

Sorry for not responding earlier. I will review this properly tomorrow. In the meantime, +1 to keeping the cache level tiling for now. I am hoping to properly play with it this quarter. Thanks!

hanhanW · 2025-07-14T20:30:02Z

We can either drop the support of cache level tiling, which is performing dummy tiling. Or we can make TileRootOpFuseProducerConsumer takes TilingLevel instead of int64_t; switches all the lowering config to CPU one at the same time. The latter is more like lit tests changes, so it should be fine. I'll try the latter approach.

Sorry for not responding earlier. I will review this properly tomorrow. In the meantime, +1 to keeping the cache level tiling for now. I am hoping to properly play with it this quarter. Thanks!

I see, thanks!. I'll ping you when this is ready. I need to work on few more patches now.

hanhanW · 2025-07-15T22:20:32Z

yay, integration tests are all happy. I can start splitting the changes out. Note that the final PR may involve many components, since I'll need to restructure the pipeline passes a bit and a few of the passes need to learn about IREE::CPU::LoweringConfigAttr.

hanhanW · 2025-07-17T01:31:55Z

I will clean this PR tomorrow. Most of the changes are separated out:

[CPU] Teach SplitReduction about IREE::CPU::LoweringConfigAttr. #21391
[CPU] Use IREE::CPU::TilingLevel in TileRootAndFuseProducerConsumer pass #21370
[CPU] Get rootOp based on lowering config in TileRootAndFuseProducerConsumer pass. #21394
[CPU] Teach TilingConfig::getVectorTileSizes about CPU lowering config. #21397
[mlir][linalg] Improve linalg.pack consumer fusion. llvm/llvm-project#148993
[CPU] Propagate cache tiling sizes in lowering config propagation. #21410
[CPU] Use lowering config attribute interface in LLVMCPUTileAndFuse. #21405
[mlir][linalg] Handle outer_dims_perm in linalg.pack consumer fusion. llvm/llvm-project#149426
[mlir][linalg] Allow pack consumer fusion if the tile size is greater than dimension size. llvm/llvm-project#149438
[mlir][linalg] Support pack consumer fusion with padding semantic for perfect tiling. llvm/llvm-project#149600
[mlir][linalg] Restrict linalg.pack to not have artificial padding. llvm/llvm-project#149624
Update regression tests to not have artificial padding. #21436
[CPU][NFC] Update pack ops to not carry artificial padding. #21440

hanhanW · 2025-07-18T19:02:20Z

There is a huge compile-time regression in toy_llama (23 sec v.s. 3 sec), where the root cause is in pack consumer fusion. I'm looking at the issue.

https://gist.github.com/hanhanW/1f125955fb4f2d23871b77a04615d0af

hanhanW · 2025-07-18T22:21:02Z

There is a huge compile-time regression in toy_llama (23 sec v.s. 3 sec), where the root cause is in pack consumer fusion. I'm looking at the issue.

https://gist.github.com/hanhanW/1f125955fb4f2d23871b77a04615d0af

llvm/llvm-project#149600 fixes the issue. The compile-time is back to 3 seconds.

There are other new issues, but I think we can fix them later: #21420

hanhanW · 2025-07-22T01:36:14Z

This is ready for review, assuming that the upstream semantic change will be accepted. Please take a look at the last two commits:

The switch for linalg.matmul dispatches: it mainly focuses on the lowering config changes.
The switch for linalg.generic dispatches:
- Changes for lowering configs.
- Update the pipeline that enumerates all the tiling levels.
- Update LLVMCPU2DScalableTo1DScalable pass to use IREE::CPU::LoweringConfigAttr.

banach-space

Thanks @hanhanW , great clean-up 🙏🏻

I've not touched this code for a while, so I've focused on the tests - the changes mostly makes sense, though I do have one question.

banach-space · 2025-07-22T17:52:46Z

compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_aarch64_lowering_strategy.mlir

  return
 }
-//   CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[64, 64, 0], [64, 64, 0], [0, 0, 0], [8, 16, 0], [0, 0, 1], [0, 0, 0]]>
+//   CHECK-DAG: #[[CONFIG:.+]] =  #iree_cpu.lowering_config<cache_parallel = [64, 64, 0], cache_reduction = [0, 0, 0], distribution = [64, 64, 0], vector_common_parallel = [8, 16, 0], vector_inner_parallel = [0, 0, 0], vector_reduction = [0, 0, 1]>


BEFORE:

[0, 0, 1], [0, 0, 0]

AFTER:

vector_inner_parallel = [0, 0, 0], vector_reduction = [0, 0, 1]

Is this correct?

Yes, it is correct. The order in the old lowering config is [[distribution], [vector-common-parallel], [vector-reduction], [vector-inner-parallel]]. The print sorts the keys in alphabetical order, so the vector_inner_parallel is in front of vector_reduction. The values are correct in this case.

iree/compiler/src/iree/compiler/Codegen/Common/TileSizeSelection.h

Lines 23 to 31 in c682836

/// We currently support the following scenarios, if

/// IREE::Codegen::LoweringConfigAttr is used:

/// 1. [[distribution]]

/// 2. [[distribution], [vector-common-parallel]]

/// 3. [[distribution], [vector-common-parallel], [vector-reduction]]

/// 4. [[distribution], [vector-common-parallel], [vector-reduction],

/// [vector-inner-parallel]]

/// 5. [[distribution], [cache-parallel], [cache-reduction],

/// [vector-parallel], [vector-reduction]]

It also replaces `TileAndFuse` pass uses with `TileRootAndFuseProducerConsumer` pass that may impact other dispatches, if they use DoubleTilingExpert. E.g., generic ops dispatches. Signed-off-by: hanhanW <[email protected]>

…gConfigAttr. Signed-off-by: hanhanW <[email protected]>

Signed-off-by: hanhanW <[email protected]>

hanhanW

All the upstream changes are in IREE now, finally. Please take a look, thanks!

jtuyls

LGTM, just a few nits and a question.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp

jtuyls · 2025-07-29T04:17:20Z

compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUSelectLoweringStrategy.cpp

    IREE::Codegen::LoweringConfigAttrInterface loweringConfig =
        getLoweringConfig(op);
-    if (!loweringConfig)
+    if (!loweringConfig || !loweringConfig.hasWorkgroupTilingLevel())


Maybe useful here to add a comment on why !loweringConfig.hasWorkgroupTilingLevel() is needed.

Good question, and I will update the PR description. The reason is that we used to require all the lowering config has four levels of tiling. With the new CPU lowering config, only the root op has the distribution tile sizes; we use such information to determine which op is the root op.

The pipeline verification expects the lowering config should have four levels of tiling -- which is no longer necessary, and it is legacy. To keep a decent verification, we only check the lowering config on root op now. It is also enough as most tiling starts from root op.

I'm going to remove the four levels requirement and refresh the verifier in a follow-up.

I already have a TODO comment here, and I'll refresh the verification logic soon. So I'm not going to update the TODO.

jtuyls · 2025-07-29T04:22:43Z

compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_aarch64_sve_lowering_strategy.mlir

-
-// CHECK-DAG:  #[[CONFIG1:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[64, 64], [4, [16]], [0, 0], [0, 0]]>
-// CHECK-DAG:  #[[CONFIG2:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[64, 64, 0], [4, [16], 0], [0, 0, 1], [0, 0, 0]]>
+// CHECK-DAG:  #[[CONFIG1:.+]] = #iree_cpu.lowering_config<vector_common_parallel = [4, [16]], vector_inner_parallel = [0, 0], vector_reduction = [0, 0]>


This one loses the [64, 64] tile sizes information compared with earlier?

Good question, and see the other comment! The distribution tile sizes is only set on the root op, so we don't expect that in the propagated lowering configs.

Signed-off-by: hanhanW <[email protected]>

…gConfigAttr. (iree-org#21354) The revision switches all the dispatches that use `CPUDoubleTilingExpert` to `IREE::CPU::LoweringConfigAttr`, which is a root-based tiling approach. There are two commits in the revision: - [The switch for `linalg.matmul` dispatches](iree-org@e885e86): it mainly focuses on the lowering config changes. - [The switch for `linalg.generic` dispatches](iree-org@48a527f): - Changes for lowering configs. - Update the pipeline that enumerates all the tiling levels. - Update `LLVMCPU2DScalableTo1DScalable` pass to use `IREE::CPU::LoweringConfigAttr`. The pipeline and lowering config verification only applies on the root op because CPUDoubleTilingExpert expects exactly four levels of tiling. After the switch, only the root op has four levels of tiling. We will need to refresh the verification logic anyway, as they are legacy code and it is easier to have better implementation today. So it will be refreshed in a follow-up. New known issues: - Trailing vector unit dims are not folded away in the root-op based pipeline, which results in larger binary sizes. Because they are unrolled: iree-org#21420 --------- Signed-off-by: hanhanW <[email protected]>

…gConfigAttr. (iree-org#21354) The revision switches all the dispatches that use `CPUDoubleTilingExpert` to `IREE::CPU::LoweringConfigAttr`, which is a root-based tiling approach. There are two commits in the revision: - [The switch for `linalg.matmul` dispatches](iree-org@e885e86): it mainly focuses on the lowering config changes. - [The switch for `linalg.generic` dispatches](iree-org@48a527f): - Changes for lowering configs. - Update the pipeline that enumerates all the tiling levels. - Update `LLVMCPU2DScalableTo1DScalable` pass to use `IREE::CPU::LoweringConfigAttr`. The pipeline and lowering config verification only applies on the root op because CPUDoubleTilingExpert expects exactly four levels of tiling. After the switch, only the root op has four levels of tiling. We will need to refresh the verification logic anyway, as they are legacy code and it is easier to have better implementation today. So it will be refreshed in a follow-up. New known issues: - Trailing vector unit dims are not folded away in the root-op based pipeline, which results in larger binary sizes. Because they are unrolled: iree-org#21420 --------- Signed-off-by: hanhanW <[email protected]> Signed-off-by: keshavvinayak01 <[email protected]>

hanhanW marked this pull request as ready for review July 12, 2025 02:08

hanhanW requested review from MaheshRavishankar and pashu123 as code owners July 12, 2025 02:08

hanhanW requested a review from Max191 July 12, 2025 02:08

hanhanW marked this pull request as draft July 14, 2025 20:14

hanhanW force-pushed the switch-multi-tiling branch from 6d7e8c1 to 237939f Compare July 15, 2025 00:23

hanhanW changed the title ~~[CPU] Switch matmul dispatches to use IREE::CPU::LoweringConfigAttr.~~ [CPU] Switch CPUDoubleTilingExpert pipeline to use IREE::CPU::LoweringConfigAttr. Jul 15, 2025

hanhanW changed the base branch from users/hanhanW/diff-for-switch-multi-tiling to main July 15, 2025 00:26

hanhanW force-pushed the switch-multi-tiling branch 3 times, most recently from 35f78de to 1e9fb9a Compare July 15, 2025 17:22

hanhanW force-pushed the switch-multi-tiling branch from cc6b5d5 to 3f71744 Compare July 16, 2025 00:33

hanhanW force-pushed the switch-multi-tiling branch 4 times, most recently from 9663fb6 to 1c8163b Compare July 18, 2025 17:55

hanhanW force-pushed the switch-multi-tiling branch 2 times, most recently from 2c44177 to f31ac84 Compare July 18, 2025 22:12

hanhanW mentioned this pull request Jul 18, 2025

[CPU] Vector trailing unit dims are not dropped in root-based tiling pipeline (elem + pack). #21420

Open

hanhanW force-pushed the switch-multi-tiling branch 2 times, most recently from 3879a82 to 06356af Compare July 21, 2025 20:42

hanhanW force-pushed the switch-multi-tiling branch 3 times, most recently from 3cf8c44 to 48a527f Compare July 21, 2025 23:52

hanhanW requested a review from banach-space July 21, 2025 23:58

hanhanW marked this pull request as ready for review July 22, 2025 01:36

banach-space reviewed Jul 22, 2025

View reviewed changes

banach-space requested a review from egebeysel July 22, 2025 18:09

hanhanW mentioned this pull request Jul 23, 2025

[Codegen] compilation fails because of vector size verification error #21359

Closed

hanhanW force-pushed the switch-multi-tiling branch from 48a527f to 8206563 Compare July 23, 2025 18:01

This was referenced Jul 25, 2025

[CPU] adjust CPUPrepareUKernelsPass to accept iree_cpu.lowering #21493

Merged

[CPU] Tile reduction dimensions for non-root reduction ops. #21500

Merged

hanhanW added 2 commits July 28, 2025 14:12

[CPU] Switch CPUDoubleTilingExpert pipeline to use IREE::CPU::Lowerin…

14032b5

…gConfigAttr. Signed-off-by: hanhanW <[email protected]>

hanhanW force-pushed the switch-multi-tiling branch from 8206563 to 14032b5 Compare July 28, 2025 21:13

Increase binary size because of iree-org#21420

ad982b7

Signed-off-by: hanhanW <[email protected]>

hanhanW requested review from banach-space and jtuyls July 28, 2025 23:14

hanhanW commented Jul 28, 2025

View reviewed changes

jtuyls approved these changes Jul 29, 2025

View reviewed changes

address comments.

3bc3800

Signed-off-by: hanhanW <[email protected]>

hanhanW merged commit 382c4fa into iree-org:main Jul 29, 2025
44 checks passed

hanhanW deleted the switch-multi-tiling branch July 29, 2025 17:41

Abhishek-Varma mentioned this pull request Aug 8, 2025

[CPU] large vector sizes being formed after shifting to IREE::CPU::LoweringConfigAttr #21633

Open

	/// We currently support the following scenarios, if
	/// IREE::Codegen::LoweringConfigAttr is used:
	/// 1. [[distribution]]
	/// 2. [[distribution], [vector-common-parallel]]
	/// 3. [[distribution], [vector-common-parallel], [vector-reduction]]
	/// 4. [[distribution], [vector-common-parallel], [vector-reduction],
	/// [vector-inner-parallel]]
	/// 5. [[distribution], [cache-parallel], [cache-reduction],
	/// [vector-parallel], [vector-reduction]]

[CPU] Switch CPUDoubleTilingExpert pipeline to use IREE::CPU::LoweringConfigAttr. #21354

[CPU] Switch CPUDoubleTilingExpert pipeline to use IREE::CPU::LoweringConfigAttr. #21354

Uh oh!

Conversation

hanhanW commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW commented Jul 12, 2025

Uh oh!

hanhanW commented Jul 12, 2025

Uh oh!

hanhanW commented Jul 14, 2025

Uh oh!

banach-space commented Jul 14, 2025

Uh oh!

hanhanW commented Jul 14, 2025

Uh oh!

hanhanW commented Jul 15, 2025

Uh oh!

hanhanW commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW commented Jul 18, 2025

Uh oh!

hanhanW commented Jul 22, 2025

Uh oh!

banach-space left a comment

Choose a reason for hiding this comment

Uh oh!

banach-space Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

hanhanW Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

jtuyls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jtuyls Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

hanhanW Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jtuyls Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

hanhanW Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanhanW commented Jul 12, 2025 •

edited

Loading

hanhanW commented Jul 17, 2025 •

edited

Loading

hanhanW commented Jul 18, 2025 •

edited

Loading