[Backend][AMD] Add temporary environment variable for pipeliner v2 by antiagainst · Pull Request #4430 · triton-lang/triton

antiagainst · 2024-07-30T16:23:09Z

This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered.

…riton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <swaters@amd.com>

* Add blocked to dot shortcut * pack tensors in vectors instead of structures * fix * add moe bypass option * initial commit * fix * fix * add missing configurations and add more checks in passes * adjust global load layout for vllm swizzling format * Remove debug print * make load width dependable on data type * fix int 8 logic * generalize load analysis: return last load in dependant laod chain instead of 2 * Add message for assert failure So that people know what the problem is when this compiler error shows up * add k=512/1024 cases * [BACKEND] Add memory space to memdesc type. (triton-lang#4027) Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others. * [BACKEND] Fix memory side effects of `tt.dot` (triton-lang#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern * remove streamPipelinev2 * [TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side. * [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends. * [BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( * [Backend][AMD] Introduce stream pipeliner v2 (triton-lang#4148) This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [AMD] Prefetch loads and independent local_stores (triton-lang#4429) This pass is enhanced to move tt.loads as early as possible. This enables buffering in registers for global loads while computing previous tiles (stream-pipelining), but may increase register pressure. If ttg.local_stores are independent of loads in the loop (i.e. double buffering in shared memory), then this pass will also move those early to overlap with global loads and compute. * [Pipeliner] Implement dynamic loop peeling - enabled for tritonamdgpu-stream-pipeline * * disabled for num_stages > 2 * updated tests * * guard each stage of ramp-down in epilogue * enable peeling for any num_stages * * pipeline reg buffers * [AMD] Fixed bug with tritonamdgpu-reorder-instructions - blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop * * only move ops early * Fix in streamPipelinerV2 * Fix lit tests * [Backend][AMD] Add temporary environment variable for pipeliner v2 (triton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <swaters@amd.com> --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Vinayak Gokhale <Vinayak.Gokhale@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: SJW <48454132+sjw36@users.noreply.github.com> Co-authored-by: SJW <swaters@amd.com>

antiagainst force-pushed the amd-enable-new-pipeliner branch from d182413 to c79bf88 Compare July 30, 2024 16:27

antiagainst marked this pull request as ready for review July 30, 2024 16:30

antiagainst requested a review from zhanglx13 as a code owner July 30, 2024 16:30

zhanglx13 approved these changes Jul 30, 2024

View reviewed changes

[Backend][AMD] Add temporary environment variable for pipeliner v2

1ec7dde

This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered.

antiagainst force-pushed the amd-enable-new-pipeliner branch from c79bf88 to 1ec7dde Compare July 30, 2024 16:33

antiagainst merged commit 86f0822 into triton-lang:main Jul 30, 2024

antiagainst deleted the amd-enable-new-pipeliner branch July 30, 2024 17:25

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend][AMD] Add temporary environment variable for pipeliner v2#4430

[Backend][AMD] Add temporary environment variable for pipeliner v2#4430
antiagainst merged 1 commit intotriton-lang:mainfrom
antiagainst:amd-enable-new-pipeliner

antiagainst commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

antiagainst commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants