[AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support by jungpark-mlir · Pull Request #9929 · triton-lang/triton

jungpark-mlir · 2026-04-05T20:36:08Z

Summary

Flat (unrolled) pipeline support: Extend ConvertWarpPipeline to handle sequences of scf.execute_region ops outside scf.for (produced by WarpPipeliner::createFlatPipeline). Emit pre-barrier, phase shift, cluster barriers, and reconverge around them.
Eliminate redundant barriers between back-to-back pipelines: When two warp-pipelined regions are adjacent with no intervening operations, the post-loop reconverge + prelude barrier + phase shift cancel out. The phase from the first pipeline carries over naturally.
Cross-pipeline LDS dependency analysis: Before eliminating boundary barriers, verify that no uncovered LDS hazard exists at the merge point. Concatenates cluster infos from both pipelines and runs analyzePipelineDependencies on the merged sequence. Skips the optimization when a dependency is found.
Adjacent-stage dependency check: Add a distance-1 check to analyzePipelineDependencies. The existing loop only checked pairs at distance 2+, so consecutive clusters sharing an LDS allocation never got a LOCAL barrier — causing ModuleMembarAnalysis to insert a redundant ttg.barrier local inside the pipeline.
Refactors: Extract analyzePipelineDependencies, emitPipelinePrelude/Postlude, and emitClusterBarrier helpers.

Test plan

back_to_back_cross_dep_kept: shared-buffer RAW at boundary → barriers kept
back_to_back_no_dep_elimination: loop B has no LDS → barriers eliminated
back_to_back_dep_covered_elimination: 3-stage loop A with internal barrier covering the cross-pipeline dep → barriers eliminated
back_to_back_for_then_flat: pipelined loop + flat pipeline → barriers eliminated
adjacent_stage_lds_dep: 3-stage pipeline verifying LOCAL barrier between adjacent stages with RAW dependency
flat_pipeline_existing_barrier: pre-existing async_wait wrapped with sched_barrier
Existing 2-stage and 3-stage pipeline tests updated

…d loops When two warp-pipelined loops execute consecutively, ConvertWarpPipeline previously emitted a full reconverge/re-phase-shift/pre-barrier sequence between them: scf.for { loop 1 } cond_barrier(warpLow) ← post-loop reconverge ttg.barrier local ← pre-barrier for loop 2 cond_barrier(warpHigh) ← pre-loop phase shift scf.for { loop 2 } The post-loop reconverge and pre-loop phase shift are complementary predicates on the same counter-based S_BARRIER, so they cancel out. The intervening ttg.barrier local is redundant when loop 1's wrap-around cluster barrier already includes a local fence (i.e. the dependency analysis determined an LDS read/write hazard exists across the wrap-around point). In that case, all pending LDS writes are already resolved before loop 1 yields, and ModuleMembarAnalysis will not need to insert additional barriers between the loops. This patch adds a post-processing pass (eliminateRedundantCondBarriers) that detects this pattern and erases the three redundant ops, reducing the barrier overhead to: scf.for { loop 1 } scf.for { loop 2 } cond_barrier(warpLow) ← final reconverge only The pass runs after all scf.for loops have been converted (patternFor) but before execute_regions are inlined (patternInline), preserving the scf.for / cond_barrier adjacency needed for pattern matching. Also updates the f16_gemm_warp_pipeline_gfx1250.py example to use range() (producing scf.for) instead of static_range() (which unrolls at the Python level) for the epilogue loop, and wraps its stages in warp_pipeline_stage annotations so the back-to-back optimization can apply.

Extend the warp-pipeline infrastructure to handle loops unrolled at the Python level (e.g. via static_range/ttgl.static_range). Previously, warp-pipelining only worked with scf.for loops. Unrolled loops produce flat sequences of border markers in the IR that were silently ignored. Three main changes: 1. WarpPipeliner: add createFlatPipeline() Scans each block for triton.warp_pipeline.border markers outside scf.for. Groups the operations between borders into clusters and wraps each in an scf.execute_region with triton.warp_pipeline.stage, triton.warp_pipeline.priority, and no_inline attributes — the same representation createPipeline() produces for loop bodies. 2. ConvertWarpPipeline: add processUnrolledPipelineRegions() + emitPipelinedFlat() After the existing patternFor converts scf.for loops, this new pass walks each function block for contiguous sequences of flat scf.execute_region ops (with triton.warp_pipeline.stage). For each sequence it emits the full barrier structure: pre-barrier, phase shift (cond_barrier warpHigh), linear dependency analysis for cluster barriers (no wrap-around since the sequence is finite), priority management (s_setprio), and post-sequence reconverge (cond_barrier warpLow). The execute_regions are then inlined by the existing InlineWarpPipelineExecuteRegionPattern. Also extends eliminateRedundantCondBarriers() to handle the case where a pipelined scf.for is immediately followed by a flat pipeline (instead of only scf.for → scf.for). When the first loop's wrap-around barrier includes a local fence, the intervening reconverge + pre-barrier + phase-shift are redundant and eliminated. 3. Gluon frontend: assert warp_pipeline_stage is inside a for loop Since the compiler now supports flat border markers, there is a risk that users place warp_pipeline_stage outside any loop, which has no meaningful pipelining semantics. A for_loop_depth counter is added to GluonSemantic and incremented/decremented in code_generator's visit_For (covering both range and static_range). warp_pipeline_stage asserts for_loop_depth > 0 at exit. The f16 GEMM example kernel is updated to use ttgl.static_range for the epilogue loop, exercising the new flat pipeline path end-to-end. Lit tests added for both WarpPipeliner (flat_pipeline_example) and ConvertWarpPipeline (flat_pipeline_backend, back_to_back_for_then_flat).

Factor out the duplicated pre-barrier + phase-shift setup and the post-pipeline reconverge logic from emitPipelinedFor and emitPipelinedFlat into shared helpers emitPipelinePrelude and emitPipelinePostlude. NFC.

Unify the duplicated pairwise dependency analysis from emitPipelinedFor (circular/wrap-around) and emitPipelinedFlat (linear) into a single analyzePipelineDependencies function parameterized by `bool circular`. NFC.

ThomasRaoux · 2026-04-06T05:00:39Z

+        # Warp-pipelining is a loop optimization: stages must be declared
+        # inside a for loop (range or static_range).  Allowing stages outside
+        # a loop would produce border markers with no well-defined iteration
+        # structure, breaking the phase-shift/reconvergence contract.
+        assert getattr(self._semantic, 'for_loop_depth', 0) > 0, ("warp_pipeline_stage must be used inside a for loop "
+                                                                  "(range or static_range)")


can we verify that on IR instead?

The for_loop_depth check exists because static_range unrolls at the Python level - by the time the compiler sees the IR, there's no loop structure left, so we can't determine from the IR alone whether the code originated from a loop. For range (dynamic), we could simply check whether the parent op is scf.for, but static_range has no such anchor. Tracking depth for both loop types uniformly looked simpler than special-casing static_range alone.

For context: the compiler already correctly groups contiguous warp_pipeline_stage blocks and drops sequences with fewer than two stages, regardless of whether they came from a loop. However, warp-pipelining is only beneficial inside a loop - the phase-shift overhead amortizes over iterations. Outside a loop, it adds barrier overhead with no pipelining gain. This check is a conservative guard rail against accidental misuse, not a correctness requirement.

To summarize, potential other options are : 1) separate check for static/dynamic range, 2) don't check it.

it seems weird that something works with static loops but still depends on the loop.

You're right - if the compiler handles flat sequences correctly regardless of origin, there's no reason to enforce loop context at the frontend. The natural guard (< 2 stages → no pipeline emitted) is sufficient. I'll remove the for_loop_depth assertion.

…rier exists emitPipelinedFlat unconditionally inserted a new cluster barrier (s_barrier) at every stage boundary, ignoring pre-existing barrier ops (e.g., async_wait) between execute_regions. This produced two barriers at the same boundary. Mirror the emitPipelinedFor logic: scan between consecutive stages for existing barrier ops and wrap them with sched_barriers instead of inserting a new one.

Two changes to analyzePipelineDependencies and eliminateRedundantCondBarriers: 1. Adjacent-stage check: the inner loop previously started at distance 2 (next = src + 2 + offset), so consecutive clusters sharing an LDS allocation never got a LOCAL barrier. Add a preliminary loop that checks clusterInfo[src] against clusterInfo[src+1] and sets bars[src+1] when they intersect. This prevents ModuleMembarAnalysis from inserting a redundant ttg.barrier local inside the pipeline. 2. Cross-pipeline analysis: when eliminating redundant cond_barriers between back-to-back pipelines, run analyzePipelineDependencies on the merged cluster sequence to verify no LDS hazard exists at the boundary. If the boundary needs a barrier (adjacent or distance-2+), the optimization is skipped. Lit tests: - back_to_back_cross_dep_kept: shared-buffer RAW at boundary → kept - back_to_back_no_dep_elimination: loop B has no LDS → eliminated - back_to_back_dep_covered_elimination: 3-stage loop A with internal barrier covering the cross-pipeline dep → eliminated - adjacent_stage_lds_dep: 3-stage pipeline verifying LOCAL barrier between adjacent stages with RAW dependency

collectNextPipelineClusters stopped at the first intervening sched_barrier / cluster barrier for flat pipelines, so only b_0 was ever visible to isCrossPipelineSafe. A cross-pipeline LDS dep involving a later flat stage (b_1, b_2, ...) was missed and the boundary cond_barrier / prelude ttg.barrier local / phase-shift cond_barrier triplet could be wrongly eliminated. Split the collection into collectLoopClusters / collectFlatClusters and walk past intra-pipeline glue (sched_barrier, s_setprio, cluster barriers, pre-existing async waits) so every flat stage is collected. Also thread B's materialized barrier flags into isCrossPipelineSafe so the merged analysis sees B's actual internal LOCAL barriers instead of relying on re-discovery from all-false placeholders. Add a lit test (@cross_pipeline_dep_in_b1) that fails without the fix.

Fold the adjacent (distance-1) and longer-distance phases of the warp-pipeline LDS dependency analysis into one loop that sweeps `dist` from 1 to maxDist. A single `wrap()` helper handles modular arithmetic, and a single `isCovered()` lambda replaces the two near-identical `isSynced` bodies for circular and linear modes. Also drop the redundant final iteration in circular mode (the old `offset == N - 1` step corresponds to `dist == 1` after wrap and only re-walked already-handled adjacent pairs). Behavior is preserved: same `(src, dst)` pairs are visited in the same order, `barrierLoc` resolves to the same slot ((dist == 1) ? dst : wrap(dst - 1)), and the coverage walk inspects the same `(src, barrierLoc]` range. No bar pattern changes. Add a thorough doc block describing the pipeline layout, the goal, the placement choice, the coverage check semantics, and the iteration order, since the conventions are easy to miss.

- Drop the stale "distance-2+ check" reference in isCrossPipelineSafe; reword to match the unified single-distance sweep in analyzePipelineDependencies. - Make emitPipelinedFor and emitPipelinedFlat use parallel section numbering (1..5) and parallel "Circular ..." / "Linear ..." headings for the dependency-analysis step. - Remove a duplicate analysis comment inside emitPipelinedFlat.

…eline

…PipelineStage helpers

antiagainst

Now looks good to me overall; just two final nits. @ThomasRaoux would you like to take another look?

ThomasRaoux

LGTM

jungpark-mlir added 6 commits April 2, 2026 14:00

Merge branch 'triton-lang:main' into 2wp

6f36913

[AMD] Refactor: extract emitPipelinePrelude/Postlude helpers

bdf6076

Factor out the duplicated pre-barrier + phase-shift setup and the post-pipeline reconverge logic from emitPipelinedFor and emitPipelinedFlat into shared helpers emitPipelinePrelude and emitPipelinePostlude. NFC.

[AMD] Refactor: extract analyzePipelineDependencies helper

b5a9e4e

Unify the duplicated pairwise dependency analysis from emitPipelinedFor (circular/wrap-around) and emitPipelinedFlat (linear) into a single analyzePipelineDependencies function parameterized by `bool circular`. NFC.

Format

401e130

ThomasRaoux reviewed Apr 6, 2026

View reviewed changes

jungpark-mlir added 2 commits April 6, 2026 21:20

Remove unnecessary for_loop_depth assertion from warp_pipeline_stage

482c64b

Merge branch 'triton-lang:main' into 2wp

149cee3

jungpark-mlir marked this pull request as ready for review April 7, 2026 16:54

jungpark-mlir requested review from antiagainst, ptillet and zhanglx13 as code owners April 7, 2026 16:54

Merge branch 'triton-lang:main' into 2wp

a96fbc7

jungpark-mlir marked this pull request as draft April 8, 2026 21:15

jungpark-mlir changed the title ~~[AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support~~ [WIP][AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support Apr 14, 2026

jungpark-mlir added 7 commits April 14, 2026 01:58

format

f487677

Merge branch 'triton-lang:main' into 2wp

35028b3

Merge branch 'triton-lang:main' into 2wp

ae5d98c

jungpark-mlir changed the title ~~[WIP][AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support~~ [AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support Apr 22, 2026

Merge branch 'main' into 2wp

296a6cf

jungpark-mlir marked this pull request as ready for review April 22, 2026 21:41

antiagainst requested changes Apr 25, 2026

View reviewed changes

jungpark-mlir added 2 commits April 26, 2026 21:36

WarpPipeliner: share helpers between createPipeline and createFlatPip…

138295b

…eline

ConvertWarpPipeline: introduce isWarpPipelineIgnorableBarrier and get…

4b3c2de

…PipelineStage helpers

jungpark-mlir added 8 commits April 26, 2026 21:36

WarpPipeliner: add step-numbered comments to createFlatPipeline

60c50fc

address review comments

611167a

Merge branch 'main' into 2wp

ce4ea7a

fix test

5aceca5

last few fixes

cdbcad7

Merge upstream main into 2wp

5ba68cb

Merge branch 'main' into 2wp

b36c134

Merge branch 'main' into 2wp

3a6eeee

antiagainst approved these changes Apr 30, 2026

View reviewed changes

Comment thread third_party/amd/lib/TritonAMDGPUTransforms/WarpPipeliner.cpp Outdated

Comment thread third_party/amd/lib/TritonAMDGPUTransforms/WarpPipeliner.cpp Outdated

jungpark-mlir added 3 commits April 30, 2026 22:42

address review

1346995

Merge branch 'main' into 2wp

bd5e919

Merge branch 'main' into 2wp

4a5cfe3

ThomasRaoux approved these changes May 12, 2026

View reviewed changes

antiagainst merged commit ad911ca into triton-lang:main May 12, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support#9929

[AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled) pipeline support#9929
antiagainst merged 31 commits into
triton-lang:mainfrom
jungpark-mlir:2wp

jungpark-mlir commented Apr 5, 2026 •

edited

Loading

Uh oh!

ThomasRaoux Apr 6, 2026

Uh oh!

jungpark-mlir Apr 6, 2026

Uh oh!

ThomasRaoux Apr 6, 2026

Uh oh!

jungpark-mlir Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antiagainst left a comment

Uh oh!

Uh oh!

Uh oh!

ThomasRaoux left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jungpark-mlir commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

ThomasRaoux Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

jungpark-mlir Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

jungpark-mlir Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jungpark-mlir commented Apr 5, 2026 •

edited

Loading