[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends by sjw36 · Pull Request #4155 · triton-lang/triton

sjw36 · 2024-06-18T14:14:12Z

Non-functional changes to expose lib/Dialect/TritonGPU/Transforms/Pipeliner infrastructure for use by other target backends.

See use here #4148.

…er target backends - Moved PipelineExpander.h, PipeliningUtility.h and Schedule.h (consolidated)

ThomasRaoux · 2024-06-19T19:03:39Z

Sorry for the delay, @pawelszczerbuk could you take a look and give post-commit comments if any?

…get backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends.

* Add blocked to dot shortcut * pack tensors in vectors instead of structures * fix * add moe bypass option * initial commit * fix * fix * add missing configurations and add more checks in passes * adjust global load layout for vllm swizzling format * Remove debug print * make load width dependable on data type * fix int 8 logic * generalize load analysis: return last load in dependant laod chain instead of 2 * Add message for assert failure So that people know what the problem is when this compiler error shows up * add k=512/1024 cases * [BACKEND] Add memory space to memdesc type. (triton-lang#4027) Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others. * [BACKEND] Fix memory side effects of `tt.dot` (triton-lang#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern * remove streamPipelinev2 * [TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side. * [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends. * [BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( * [Backend][AMD] Introduce stream pipeliner v2 (triton-lang#4148) This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [AMD] Prefetch loads and independent local_stores (triton-lang#4429) This pass is enhanced to move tt.loads as early as possible. This enables buffering in registers for global loads while computing previous tiles (stream-pipelining), but may increase register pressure. If ttg.local_stores are independent of loads in the loop (i.e. double buffering in shared memory), then this pass will also move those early to overlap with global loads and compute. * [Pipeliner] Implement dynamic loop peeling - enabled for tritonamdgpu-stream-pipeline * * disabled for num_stages > 2 * updated tests * * guard each stage of ramp-down in epilogue * enable peeling for any num_stages * * pipeline reg buffers * [AMD] Fixed bug with tritonamdgpu-reorder-instructions - blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop * * only move ops early * Fix in streamPipelinerV2 * Fix lit tests * [Backend][AMD] Add temporary environment variable for pipeliner v2 (triton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <swaters@amd.com> --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Vinayak Gokhale <Vinayak.Gokhale@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: SJW <48454132+sjw36@users.noreply.github.com> Co-authored-by: SJW <swaters@amd.com>

[Pipeliner] Moved common infrastructure to include dir for use by oth…

e0bd4d8

…er target backends - Moved PipelineExpander.h, PipeliningUtility.h and Schedule.h (consolidated)

sjw36 requested a review from ptillet as a code owner June 18, 2024 14:14

Merge branch 'main' into sjw-nfc-pipeliner-hdrs

12bf706

sjw36 changed the title ~~[Pipeliner-NFC] Moved common infrastructure to include dir for use by other target backends~~ [Pipeliner-NFC] Expose Pipeliner infrastructure for use by other target backends Jun 18, 2024

sjw36 marked this pull request as draft June 18, 2024 17:09

antiagainst approved these changes Jun 18, 2024

View reviewed changes

antiagainst marked this pull request as ready for review June 18, 2024 18:50

antiagainst requested review from ThomasRaoux and pawelszczerbuk June 18, 2024 18:50

antiagainst changed the title ~~[Pipeliner-NFC] Expose Pipeliner infrastructure for use by other target backends~~ [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends Jun 19, 2024

antiagainst merged commit 6f6d032 into triton-lang:main Jun 19, 2024

jlebar mentioned this pull request Jun 25, 2024

Pass repr=key when calling JITFunction.cache_hook #4207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends#4155

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends#4155
antiagainst merged 2 commits intotriton-lang:mainfrom
ROCm:sjw-nfc-pipeliner-hdrs

sjw36 commented Jun 18, 2024

Uh oh!

ThomasRaoux commented Jun 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sjw36 commented Jun 18, 2024

Uh oh!

ThomasRaoux commented Jun 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants