[BACKEND] Fix memory side effects of `tt.dot` by Jokeren · Pull Request #4033 · triton-lang/triton

Jokeren · 2024-05-29T19:37:51Z

Replaced triton_nvidia_gpu.async_dot with triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe warp_group_dot is a better name?
Removed memdesc from tt.dot because tt.dot should be pure, without any side effects
Removed hacks in Membar analysis.
Unified wgmma code generation in the backend.
Introduced the DotLike trait for tt.dot and triton_nvidia_gpu.group_dot.
Updated comments in matmul loop pipeline (maybe incomplete).
Removed the ConvertDotConvert pattern

ThomasRaoux

do we need to update:
triton/lib/Dialect/TritonGPU/Transforms/F32DotTC.cpp? I'm guessing it will stop working for wgmma but not sure if we test it.

Added few nits, looks good to me overall.

ThomasRaoux · 2024-05-29T21:24:03Z

    ModuleOp mod = getOperation();
    mod.walk([&](Operation *op) {
-      if (!isa<tt::DotOp, ttng::DotAsyncOp>(op))
+      if (!op->hasTrait<OpTrait::DotLike>())


I believe the logic in this pass actually only applies to wgmma

ThomasRaoux

Thanks!

1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern

* Add blocked to dot shortcut * pack tensors in vectors instead of structures * fix * add moe bypass option * initial commit * fix * fix * add missing configurations and add more checks in passes * adjust global load layout for vllm swizzling format * Remove debug print * make load width dependable on data type * fix int 8 logic * generalize load analysis: return last load in dependant laod chain instead of 2 * Add message for assert failure So that people know what the problem is when this compiler error shows up * add k=512/1024 cases * [BACKEND] Add memory space to memdesc type. (triton-lang#4027) Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others. * [BACKEND] Fix memory side effects of `tt.dot` (triton-lang#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern * remove streamPipelinev2 * [TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side. * [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends. * [BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( * [Backend][AMD] Introduce stream pipeliner v2 (triton-lang#4148) This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [AMD] Prefetch loads and independent local_stores (triton-lang#4429) This pass is enhanced to move tt.loads as early as possible. This enables buffering in registers for global loads while computing previous tiles (stream-pipelining), but may increase register pressure. If ttg.local_stores are independent of loads in the loop (i.e. double buffering in shared memory), then this pass will also move those early to overlap with global loads and compute. * [Pipeliner] Implement dynamic loop peeling - enabled for tritonamdgpu-stream-pipeline * * disabled for num_stages > 2 * updated tests * * guard each stage of ramp-down in epilogue * enable peeling for any num_stages * * pipeline reg buffers * [AMD] Fixed bug with tritonamdgpu-reorder-instructions - blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop * * only move ops early * Fix in streamPipelinerV2 * Fix lit tests * [Backend][AMD] Add temporary environment variable for pipeliner v2 (triton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <swaters@amd.com> --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Vinayak Gokhale <Vinayak.Gokhale@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: SJW <48454132+sjw36@users.noreply.github.com> Co-authored-by: SJW <swaters@amd.com>

Jokeren added 14 commits May 27, 2024 09:03

Update

c1d7ab7

Update

96b3c8c

Update

98345d6

Update

a6bb818

Update

cdc96f6

Update

7702eaf

Update

e9a2fef

Merge branch 'main' into keren/cleanup-dot

bc84629

Update

1ec506a

Update

f662661

Update comments

837e98c

Update

29736bf

Update

9df4590

Update

c1f1026

ThomasRaoux reviewed May 29, 2024

View reviewed changes

Change name

7b6059f

ThomasRaoux approved these changes May 30, 2024

View reviewed changes

Jokeren marked this pull request as ready for review May 30, 2024 01:29

Jokeren requested a review from ptillet as a code owner May 30, 2024 01:29

Jokeren merged commit d86ae7b into main May 30, 2024

Jokeren deleted the keren/cleanup-dot branch May 30, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Fix memory side effects of `tt.dot`#4033

[BACKEND] Fix memory side effects of `tt.dot`#4033
Jokeren merged 15 commits intomainfrom
keren/cleanup-dot

Jokeren commented May 29, 2024

Uh oh!

ThomasRaoux left a comment

Uh oh!

Uh oh!

ThomasRaoux May 29, 2024

Uh oh!

ThomasRaoux left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jokeren commented May 29, 2024

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ThomasRaoux May 29, 2024

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants