New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WarpSpec] add support for multiple channels sharing the same smem #9

Merged

manman-ren merged 10 commits into ws from mren/ws-reuse-buffer

Jan 14, 2025

Contributor

manman-ren commented Dec 14, 2024 •

edited

Loading

location

Summary: We already have channelsGroupedByProducers and channelsGroupedByConsumers. For one-producer-multi-consumer mode, a single buffer will be used, channelsGroupedByProducers is used for this. channelsGroupedByConsumers is to minimize the insertion of sync primitives, a single set of communication ops will be inserted.

For this patch, we want to share the same smem location for multiple channels that are live in different loop nests. We add allocation.shareGroup attributes to the local_allocs corresponding to channels that reuse the same smem location.

In order to reuse the same smem location, we update bufferIdx and phase through all the loop nests that share smem locations. We handle the following cases:

for # persistent loop
  for # can be nested under if
  for # can be nested under if
Or
for # can be nested under if
for # can be nested under if
Or
for # persistent loop
  for # can be nested under if

The generated code will look like

for(accumLoopCount)
  t1 = IfOp
    forOp # loop A
    tmpIdx = accumLoopCount + numStepsA
    yield tmpIdx
    else yield accumLoopCount
  t2 = IfOp
    forOp # loop B
    tmpIdx = t1 + numStepsB
    yield tmpIdx
    else yield t1
  yield t2 for accumLoopCount

facebook-github-bot added the CLA Signed label

Contributor Author

manman-ren commented Dec 14, 2024 •

edited

Loading

The implementation makes changes to appendBufferIdxArgs/createNewLoop to add an argument in outer loop for accumLoopCount or to add a constant for a place holder when there is no outer loop. It also changes specializeIfOp to create a result for the if to propagate the accumLoopCount.
We then use a helper function updateAccumLoopCount to correctly link up the values.

Phase 1:

ForOp with accumLoopCount as an argument
   If
      use accumLoopCount to set initialBufferIdx
      ForOp
      generate numSteps and create an add op for accumLoopCount + numSteps
  Yield for ForOp with accumLoopCount (this will be updated later in updateAccumLoopCount)

Contributor

htyu commented Dec 18, 2024

This is great work, thanks!

BTW, can you include a lit test to help understand what this PR do exactly?

manman-ren force-pushed the mren/ws-reuse-buffer branch from 1015fdc to 492969a Compare

December 19, 2024 17:00

htyu reviewed

View reviewed changes

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp

+                  if (kv.second.size() <= 1)
+                    continue;
+                  bufferMap[kv.first].getDefiningOp()->setAttr(
+                      "allocation.shareGroup",

Contributor

htyu Dec 18, 2024

A dumb question, why is this needed if same buffer is already used on the IR?

htyu reviewed

View reviewed changes

Contributor

htyu left a comment

Looks great with the refactoring! Just left some nit comments so far.

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp Outdated

@@ @@ -1673,6 +2185,8 @@ class TritonGPUWSCodePartitionPass @@
                     funcOp.dump();
                   });
+                  // Assuming there are no changes to loops in loopWithBufferReuse.
+                  DenseMap<AsyncTaskId, Value> mapForAccumLoopVar;

Contributor

htyu Jan 10, 2025

This doesn't seem to be used anywhere.

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp Outdated

+                  SmallVector<Operation *> loopWithBufferReuse;
+                  reuseBuffers(asyncTaskTopOps, channels, mapToRepresenting,
+                               loopWithBufferReuse);
+                  unsigned loopsWithAccumLoopCount = loopWithBufferReuse.size();

Contributor

htyu Jan 10, 2025

loopsWithAccumLoopCount no longer used?

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp Outdated

+                               loopWithBufferReuse);
+                  unsigned loopsWithAccumLoopCount = loopWithBufferReuse.size();
+                  // Use and update loopWithBufferReuse.
+                  Value tmpAccumLoopCount =

Contributor

htyu Jan 10, 2025

The return values is unused?

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp Outdated

+              }
+              // For ForOps in taskTopOps, create new ForOp for each by adding phase,
+              // bufferIdx to the arguments.

Contributor

htyu Jan 10, 2025

Update comment to reflect the accumulatedLoopCount arg?

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp Outdated

+                  builder.setInsertionPoint(taskTopOps[0]);
+                  tmpAccumLoopCount = builder.createWithAsyncTaskIds<arith::ConstantIntOp>(
+                      oneFor->getLoc(), 0, 64);
+                  // populateLoopSteps(loopWithBufferReuse, accumLoopCountsAfterLoop,

Contributor

htyu Jan 10, 2025

Remove the comment?

lib/Dialect/TritonGPU/Transforms/WSCodePartition.cpp

+                    // numSteps = ((upperBound - lowerBound) + forOpStep - 1) / forOpStep
+                    Value numSteps = getNumSteps(forOp, builder);
+                    // TODO: use a global flattened iteration space index for multi-dim loops.

Contributor

htyu Jan 10, 2025

If we add accumulatedLoopCount to non-sharing loop nest too, we can just unify this path with the hasParallelReuse path?

Contributor Author

manman-ren Jan 11, 2025

Yes, that is right. We can use accumulatedLoopCount which is more accurate for persistent kernels when the inner loop has varying numSteps. accumulatedLoopCount will be an argument for the outer persistent loop.

lib/Analysis/Allocation.cpp

-                                  });
+                    std::for_each(
+                        liveOperations.begin(), liveOperations.end(), [&](Operation *liveOp) {
+                          if (buffer->regionIds.size() > 1 || buffer->sharingGroup >= 0) {

Contributor

htyu Jan 10, 2025

Why do we need this extra check? Is that for buffer sharing in single-consumer mode?

Contributor Author

manman-ren Jan 11, 2025

When we have buffer sharing, we want to be conservative and mark its live range as the full live range. Otherwise, we will need to combine the buffers in the same sharing group and analyze the union of regions, and the union of live ranges.

htyu approved these changes

View reviewed changes

manman-ren added 10 commits

January 13, 2025 18:26


          [WarpSpec] add support for multiple channels sharing the same smem

0064ddf

location

Summary: We already have channelsGroupedByProducers and
channelsGroupedByConsumers. For one-producer-multi-consumer mode,
a single buffer will be used, channelsGroupedByProducers is used
for this. channelsGroupedByConsumers is to minimize the insertion of
sync primitives, a single set of communication ops will be inserted.

For this patch, we want to share the same smem location for multiple
channels that are live in different loop nests. We add
allocation.shareGroup attributes to the local_allocs corresponding to
channels that reuse the same smem location.

In order to reuse the same smem location, we update bufferIdx and phase
through all the loop nests that share smem locations. We handle the
following cases:
for # persistent loop
  for # can be nested under if
  for # can be nested under if
Or
for # can be nested under if
for # can be nested under if
Or
for # persistent loop
  for # can be nested under if

The generated code will look like
for(accumLoopCount)
  t1 = IfOp
    forOp # loop A
    tmpIdx = accumLoopCount + numStepsA
    yield tmpIdx
    else yield accumLoopCount
  t2 = IfOp
    forOp # loop B
    tmpIdx = t1 + numStepsB
    yield tmpIdx
    else yield t1
  yield t2 for accumLoopCount

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          add lit test

114e61b

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          fix lit test

5f20e04

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          fix with more comments and separate out of specializeRegion

d8de3bf

Summary: half done, buildable

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          update2

7af7730

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          clean up

df8a57b

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          more clean up

a9a9544

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

fix

92d2390

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          address review

3a6a9c1

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          clean up

40a5741

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

manman-ren force-pushed the mren/ws-reuse-buffer branch from 0d062fe to 40a5741 Compare

January 14, 2025 02:26

manman-ren merged commit c286564 into ws

2 checks passed

manman-ren deleted the mren/ws-reuse-buffer branch

January 16, 2025 02:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels