Triton bypassLDS changes by jithunnair-amd · Pull Request #695 · ROCm/triton

jithunnair-amd · 2025-01-07T17:24:28Z

Intended for updating triton in pytorch release/2.5 branch

…stead of 2

So that people know what the problem is when this compiler error shows up

Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others.

1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern

) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side.

…get backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends.

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

This pass is enhanced to move tt.loads as early as possible. This enables buffering in registers for global loads while computing previous tiles (stream-pipelining), but may increase register pressure. If ttg.local_stores are independent of loads in the loop (i.e. double buffering in shared memory), then this pass will also move those early to overlap with global loads and compute.

- enabled for tritonamdgpu-stream-pipeline

* updated tests

* enable peeling for any num_stages

- blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop

…riton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <swaters@amd.com>

jataylo

Sniff test worked for me, lets get a build and run full suite testing.

oplavsic and others added 30 commits January 5, 2025 20:16

Add blocked to dot shortcut

852f156

pack tensors in vectors instead of structures

202dd21

fix

b854eab

add moe bypass option

502b26c

initial commit

21f5acc

fix

246a377

fix

75a6e1a

add missing configurations and add more checks in passes

c9d65d5

adjust global load layout for vllm swizzling format

8e4bd03

Remove debug print

b9f30af

make load width dependable on data type

fe2820b

fix int 8 logic

33e9591

generalize load analysis: return last load in dependant laod chain in…

29333de

…stead of 2

Add message for assert failure

3072f2d

So that people know what the problem is when this compiler error shows up

add k=512/1024 cases

96b27bc

[BACKEND] Add memory space to memdesc type. (triton-lang#4027)

a06c437

Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others.

remove streamPipelinev2

5c2bd6a

[TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384

3327b4a

) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side.

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other tar…

9728b51

…get backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends.

[BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196)

5f4ee26

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

[Pipeliner] Implement dynamic loop peeling

a6d115a

- enabled for tritonamdgpu-stream-pipeline

* disabled for num_stages > 2

e4d0a43

* updated tests

* guard each stage of ramp-down in epilogue

7265557

* enable peeling for any num_stages

* pipeline reg buffers

6cb24f1

[AMD] Fixed bug with tritonamdgpu-reorder-instructions

1636444

- blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop

* only move ops early

62f4fbc

Fix in streamPipelinerV2

07d0178

oplavsic and others added 2 commits January 6, 2025 14:02

Fix lit tests

2639388

jithunnair-amd requested review from jataylo and jayfurmanek January 7, 2025 17:25

jataylo approved these changes Jan 7, 2025

View reviewed changes

jataylo marked this pull request as ready for review January 9, 2025 11:01

jataylo requested review from antiagainst and zhanglx13 as code owners January 9, 2025 11:01

jataylo merged commit b253a53 into release/internal/3.1.x Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton bypassLDS changes#695

Triton bypassLDS changes#695
jataylo merged 32 commits intorelease/internal/3.1.xfrom
plognjen/release/internal/3.1.x

jithunnair-amd commented Jan 7, 2025

Uh oh!

jataylo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

jithunnair-amd commented Jan 7, 2025

Uh oh!

jataylo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants