Skip to content

Conversation

@IanWood1
Copy link
Contributor

@IanWood1 IanWood1 commented Sep 16, 2025

This change allows producers to try to fuse with all consumers. Previously, fusing with multiple consumers was only allowed if the consumers were all truncate ops. This has been removed.

This has some side effects that require a few other accompanying changes:

  1. This PR can lead to dispatches with many ops and many operands to the dispatch. To prevent forming dispatches with more operands than the runtime can handle, wouldExceedOperandLimit was added to limit the number of operands to 16.
  2. The golden times for datatiling llama decode were slightly increased. See [GPU] VectorDistribute iree_encoding.set_encoding performance #22841 for more details.
  3. options.numIterations = 32 was added to more aggressively fuse multi-use elementwise ops in the same dispatch to prevent codegen issues.
  4. Changed check to prevent fusion from IREE::LinalgExt::isBitExtendOp() to IREE::Flow::isClonableIntoDispatchOp() to prevent fusing with scatter's index producer when it should be cloned.
  5. Changed error to warning when IREE::Flow::moveFollowingOpIntoDispatchRegion fails. This can occur because hasTransitiveDependencyOnFusionGroup does not account for moving ops into dispatch regions. For example, A and B have no use-def relation and A does have a "transitive dep on the fusion group" but B doesn't. If you place A and B in the same dispatch, then asking if B hasTransitiveDependencyOnFusionGroup you must also consider all ops in the dispatch too. Which is currently unaccounted for. Side note: this change was supposed to be made in [Dispatch Creation] Don't fuse uses from above #22708, but I think I merged without actually making this change.

Related: #22528
Closes: #22462

ci-extra: test_torch

@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch 2 times, most recently from 978db84 to 474562e Compare September 18, 2025 19:42
@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch from 9cecc04 to ebfe3c7 Compare October 6, 2025 20:59
@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch 2 times, most recently from d5cf1a2 to 34db4ca Compare November 3, 2025 20:11
@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch from 34db4ca to cbd1c3b Compare November 5, 2025 22:41
@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch 2 times, most recently from aba7406 to ec408a4 Compare December 1, 2025 20:09
@IanWood1 IanWood1 force-pushed the enable_fuse_all_multi_use branch from a827a1e to d69930d Compare December 4, 2025 18:41
@IanWood1
Copy link
Contributor Author

IanWood1 commented Dec 4, 2025

#22799 didn't seem to fix the issue. The reduction dispatch seems to get much slower when adding an additional result. Here is the before and after (slow)

Update: I created #22841 and I'm just going to increase the golden times until that is resolved.

@IanWood1
Copy link
Contributor Author

IanWood1 commented Dec 8, 2025

Ignore the failure on PkgCI / Test Torch / torch_models tests :: amdgpu_mi325_gfx942 (pull_request), it's getting fixed with #22855

@IanWood1 IanWood1 marked this pull request as ready for review December 8, 2025 18:55
Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long time coming!

@IanWood1 IanWood1 merged commit fd4ff2b into iree-org:main Dec 8, 2025
49 of 52 checks passed
IanWood1 added a commit that referenced this pull request Dec 9, 2025
Lowers golden dispatch counts to reflect the expected numbers after
#22011




ci-extra: test_torch

Signed-off-by: Ian Wood <[email protected]>
keshavvinayak01 pushed a commit that referenced this pull request Jan 27, 2026
This change allows producers to try to fuse with all consumers.
Previously, fusing with multiple consumers was only allowed if the
consumers were all truncate ops. This has been removed.

This has some side effects that require a few other accompanying
changes:
1. This PR can lead to dispatches with many ops and many operands to the
dispatch. To prevent forming dispatches with more operands than the
runtime can handle, `wouldExceedOperandLimit` was added to limit the
number of operands to 16.
2. The golden times for datatiling llama decode were slightly increased.
See #22841 for more details.
3. `options.numIterations = 32` was added to more aggressively fuse
multi-use elementwise ops in the same dispatch to prevent codegen
issues.
4. Changed check to prevent fusion from
`IREE::LinalgExt::isBitExtendOp()` to
`IREE::Flow::isClonableIntoDispatchOp()` to prevent fusing with
scatter's index producer when it should be cloned.
5. Changed error to warning when
`IREE::Flow::moveFollowingOpIntoDispatchRegion` fails. This can occur
because `hasTransitiveDependencyOnFusionGroup` does not account for
moving ops into dispatch regions. For example, A and B have no use-def
relation and A does have a "transitive dep on the fusion group" but B
doesn't. If you place A and B in the same dispatch, then asking if B
`hasTransitiveDependencyOnFusionGroup` you must also consider all ops in
the dispatch too. Which is currently unaccounted for. Side note: this
change was supposed to be made in
#22708, but I think I merged
without actually making this change.

Related: #22528
Closes: #22462

ci-extra: test_torch

---------

Signed-off-by: Ian Wood <[email protected]>
Signed-off-by: Keshav Vinayak Jha <[email protected]>
keshavvinayak01 pushed a commit that referenced this pull request Jan 27, 2026
Lowers golden dispatch counts to reflect the expected numbers after
#22011

ci-extra: test_torch

Signed-off-by: Ian Wood <[email protected]>
Signed-off-by: Keshav Vinayak Jha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable fusion when producers have multiple uses

2 participants