Tile perf enhancements - continued #6561
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
#6376 introduced an optimization to the Tile kernels to process inputs where the net tiling effect is just multiple copies of the input buffer.
For example:
input shape = [1, 1, 256 * 50]
repeats = [1, 200, 1]
output shape = [1, 200, 256 * 50]
This worked well when there was no batching involved and the optimization didn't kick-in when batching was introduced.
As a slight extension, handle batching in this optimization.
For example:
input shape = [5, 1, 256 * 50]
repeats = [1, 200, 1]
output shape = [5, 200, 256 * 50]
In this case, we would copy each of the 5 sub-tensors in the batch 200 times.
Improves the perf of a 1PP model by ~30% (95 percentile) when batch size is 5.
Motivation and Context
Performance