[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining#3634
[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining#3634chengjunlu wants to merge 15 commits intomainfrom
Conversation
ab40c73 to
ddd2c17
Compare
2a5b53e to
f6b90ca
Compare
|
Can you please show the performance impact in % instead of the new TFlops? |
whitneywhtsang
left a comment
There was a problem hiding this comment.
We may want to spend some time on doing code refactoring, similar code pattern seems to be repeating.
f6b90ca to
6334ed5
Compare
Note: this change split from #3634. Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
This PR limits prefetch to only dense memory, to avoid polluting the cache. Benchmark CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14647344471 No performance impact to key benchmarks. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
3b523ab to
1f05df2
Compare
This PR adds a new argument `mask` to the prefetch operation. It is to prepare for handling prefetching of tensor of pointers, as loads from tensor of pointers can be masked. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
2a0f6b8 to
4bbb4f9
Compare
75eb507 to
ad37157
Compare
24ea0fa to
08e88f3
Compare
… operation. Add a mask operand for boundary check.
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>
08e88f3 to
e3d441a
Compare
Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>
Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>
Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>
…r of pointers (#4064) Loads with tensor of pointers operands that have been proven to reference memory in row major (and are contained in a scf.for loop) order are now prefetched using 2D prefetching intrinsics. Note: This PR is derived from PR #3634 --------- Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>
| // Swap the shape to make it row major and then get the tiling | ||
| // size base on row major shape. | ||
| std::swap(tensorShape[0], tensorShape[1]); | ||
| tensorType = RankedTensorType::get( |
There was a problem hiding this comment.
This change will cause GEMM with A transpose to fail.
There was a problem hiding this comment.
yup, just for @chengjunlu to know what changes are not merged from original change.
There was a problem hiding this comment.
Where is the case? I'd like to check the bug.
There was a problem hiding this comment.
Likely is the one reported in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14777632147/job/41503097292.
|
Close this PR. The remaining changes is in the new #4611 |
This is the first implementation of the prefetching the memory referred by tensor of pointers.
There are no degradations and the targeted workload (gemm-tensor-of-ptr) exhibits a performance improvement close to 100% (geomean from 17.1 TFlops to 33 TFlops)