[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining by chengjunlu · Pull Request #3634 · intel/intel-xpu-backend-for-triton

chengjunlu · 2025-03-10T07:39:55Z

This is the first implementation of the prefetching the memory referred by tensor of pointers.
There are no degradations and the targeted workload (gemm-tensor-of-ptr) exhibits a performance improvement close to 100% (geomean from 17.1 TFlops to 33 TFlops)

whitneywhtsang · 2025-03-14T10:38:36Z

Can you please show the performance impact in % instead of the new TFlops?

whitneywhtsang

We may want to spend some time on doing code refactoring, similar code pattern seems to be repeating.

Note: this change split from #3634. Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

This PR limits prefetch to only dense memory, to avoid polluting the cache. Benchmark CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14647344471 No performance impact to key benchmarks. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

This PR adds a new argument `mask` to the prefetch operation. It is to prepare for handling prefetching of tensor of pointers, as loads from tensor of pointers can be masked. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

… operation. Add a mask operand for boundary check.

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

…efetch

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

…r of pointers (#4064) Loads with tensor of pointers operands that have been proven to reference memory in row major (and are contained in a scf.for loop) order are now prefetched using 2D prefetching intrinsics. Note: This PR is derived from PR #3634 --------- Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

…efetch

etiotto · 2025-05-02T15:38:22Z

Benchmark results (https://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1&var-tag=ci%7Creport%7Cpr3634&var-table=ci&var-bench=All&var-device=Intel%28R%29%20Data%20Center%20GPU%20Max%201550&var-compiler=triton&var-backend=All&var-baseline_backend=triton-ci-XPU%201550&var-target_backend=triton-ci-XPU%201550&from=now-23d&to=now)

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

etiotto · 2025-05-02T19:35:23Z

      // Swap the shape to make it row major and then get the tiling
      // size base on row major shape.
      std::swap(tensorShape[0], tensorShape[1]);
+      tensorType = RankedTensorType::get(


This change will cause GEMM with A transpose to fail.

yup, just for @chengjunlu to know what changes are not merged from original change.

Where is the case? I'd like to check the bug.

Likely is the one reported in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14777632147/job/41503097292.

It is that one

chengjunlu · 2025-07-03T09:18:36Z

Close this PR. The remaining changes is in the new #4611

This was linked to issues Mar 10, 2025

[Performance] Enable prefetching for tt.load with tensor of pointer #3484

Closed

[Performance] Enhance the software loop pipelining for tt.load with tensor of pointer #3485

Closed

chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from ab40c73 to ddd2c17 Compare March 12, 2025 06:51

chengjunlu changed the title ~~[Draft] Add the support of tensor of pointer in the prefetching and loop pipelining~~ [Performance] Add the support of tensor of pointer in the prefetching and loop pipelining Mar 12, 2025

chengjunlu requested review from alexbaden, etiotto and whitneywhtsang March 12, 2025 06:52

chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a5b53e to f6b90ca Compare March 13, 2025 07:38

whitneywhtsang reviewed Mar 14, 2025

View reviewed changes

Comment thread third_party/intel/backend/compiler.py

Comment thread third_party/intel/backend/compiler.py Outdated

etiotto reviewed Mar 20, 2025

View reviewed changes

etiotto requested a review from a team April 15, 2025 14:15

whitneywhtsang mentioned this pull request Apr 24, 2025

Limit prefetch to only dense memory #4009

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from f6b90ca to 6334ed5 Compare April 24, 2025 17:28

This was referenced Apr 24, 2025

[TTIG_PrefetchOp] Add mask argument #4011

Merged

[NFC][MatmulLoopPipeline] Remove unnecessary argument #4012

Merged

whitneywhtsang added a commit that referenced this pull request Apr 24, 2025

[NFC][MatmulLoopPipeline] Remove unnecessary argument (#4012)

5b4e69e

Note: this change split from #3634. Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 3b523ab to 1f05df2 Compare April 25, 2025 01:39

whitneywhtsang mentioned this pull request Apr 25, 2025

[MatmulLoopPipeline] Predicate PrefetchOp #4016

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a0f6b8 to 4bbb4f9 Compare April 27, 2025 22:23

whitneywhtsang mentioned this pull request Apr 28, 2025

[MatmulLoopPipeline] Populate LoadOp mask to PrefetchOp #4030

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 75eb507 to ad37157 Compare April 28, 2025 01:02

etiotto assigned whitneywhtsang and etiotto Apr 28, 2025

etiotto assigned chengjunlu Apr 28, 2025

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 24ea0fa to 08e88f3 Compare April 28, 2025 21:55

chengjunlu and others added 6 commits April 29, 2025 03:55

Support tensor of pointer as the pointer parameter of the prefetching…

9096ffe

… operation. Add a mask operand for boundary check.

Support the tensor of pointer in the matmul loop pipelining.

6d4aad2

Fix failures

6fb6ced

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

[TritonIntelGPUPipeline] Remove supportRegularPtr option

d36a089

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

address review comment

78b26b8

Fix failing CI test

e3d441a

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 08e88f3 to e3d441a Compare April 29, 2025 03:59

Only prefetch 2D loads

41971ca

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

etiotto mentioned this pull request Apr 29, 2025

[MatmulLoopPipeline]: Prefetch 2D loads #4051

Merged

etiotto added 2 commits April 30, 2025 14:43

Merge remote-tracking branch 'origin/main' into chengjun/tensorptr_pr…

eae1e7c

…efetch

Refactor LoadStoreOpToLLVM.cpp

2bf17c7

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

etiotto marked this pull request as draft April 30, 2025 19:54

whitneywhtsang mentioned this pull request May 1, 2025

[prefetch]: Add support for prefetching load operations using a tensor of pointers #4064

Merged

Fix failing gemm bmk

d28fefe

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

Merge remote-tracking branch 'origin/main' into chengjun/tensorptr_pr…

1d80b59

…efetch

etiotto and others added 4 commits May 2, 2025 16:03

Fix merge

95f7126

Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com>

Merge branch 'main' into chengjun/tensorptr_prefetch

fc66f85

Merge branch 'main' into chengjun/tensorptr_prefetch

b578e2e

recover original remaining changes

cad0433

etiotto reviewed May 2, 2025

View reviewed changes

chengjunlu closed this Jul 3, 2025

chengjunlu deleted the chengjun/tensorptr_prefetch branch February 24, 2026 05:21

Conversation

chengjunlu commented Mar 10, 2025 • edited by etiotto Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whitneywhtsang commented Mar 14, 2025

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etiotto commented May 2, 2025

Uh oh!

etiotto May 2, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang May 2, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto May 2, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 6, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang May 6, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto May 8, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chengjunlu commented Mar 10, 2025 •

edited by etiotto

Loading