Skip to content

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends#4155

Merged
antiagainst merged 2 commits intotriton-lang:mainfrom
ROCm:sjw-nfc-pipeliner-hdrs
Jun 19, 2024
Merged

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends#4155
antiagainst merged 2 commits intotriton-lang:mainfrom
ROCm:sjw-nfc-pipeliner-hdrs

Conversation

@sjw36
Copy link
Copy Markdown
Contributor

@sjw36 sjw36 commented Jun 18, 2024

Non-functional changes to expose lib/Dialect/TritonGPU/Transforms/Pipeliner infrastructure for use by other target backends.

See use here #4148.

…er target backends

      -	Moved PipelineExpander.h, PipeliningUtility.h and Schedule.h (consolidated)
@sjw36 sjw36 requested a review from ptillet as a code owner June 18, 2024 14:14
@sjw36 sjw36 changed the title [Pipeliner-NFC] Moved common infrastructure to include dir for use by other target backends [Pipeliner-NFC] Expose Pipeliner infrastructure for use by other target backends Jun 18, 2024
@sjw36 sjw36 marked this pull request as draft June 18, 2024 17:09
@antiagainst antiagainst marked this pull request as ready for review June 18, 2024 18:50
@antiagainst antiagainst changed the title [Pipeliner-NFC] Expose Pipeliner infrastructure for use by other target backends [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends Jun 19, 2024
@antiagainst antiagainst merged commit 6f6d032 into triton-lang:main Jun 19, 2024
@ThomasRaoux
Copy link
Copy Markdown
Collaborator

Sorry for the delay, @pawelszczerbuk could you take a look and give post-commit comments if any?

bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…get backends (triton-lang#4155)

Non-functional changes to expose
`lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by
other target backends.
plognjen pushed a commit to ROCm/triton that referenced this pull request Jan 5, 2025
…get backends (triton-lang#4155)

Non-functional changes to expose
`lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by
other target backends.
jataylo pushed a commit to ROCm/triton that referenced this pull request Jan 9, 2025
* Add blocked to dot shortcut

* pack tensors in vectors instead of structures

* fix

* add moe bypass option

* initial commit

* fix

* fix

* add missing configurations and add more checks in passes

* adjust global load layout for vllm swizzling format

* Remove debug print

* make load width dependable on data type

* fix int 8 logic

* generalize load analysis: return last load in dependant laod chain instead of 2

* Add message for assert failure

So that people know what the problem is when this compiler error shows up

* add k=512/1024 cases

* [BACKEND] Add memory space to memdesc type. (triton-lang#4027)

Currently only shared memory is supported but this will allow supporting
different kinds of local memory (like private) or others.

* [BACKEND] Fix memory side effects of `tt.dot` (triton-lang#4033)

1. Replaced `triton_nvidia_gpu.async_dot` with
`triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe
`warp_group_dot` is a better name?
2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure,
without any side effects
3. Removed hacks in Membar analysis.
4. Unified wgmma code generation in the backend.
5. Introduced the `DotLike` trait for `tt.dot` and
`triton_nvidia_gpu.group_dot`.
6. Updated comments in matmul loop pipeline (maybe incomplete).
7. Removed the `ConvertDotConvert` pattern

* remove streamPipelinev2

* [TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384)

Software pipeling should be not using them. This makes it cleaner and
prepares reusing the same test inputs for AMD side.

* [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends (triton-lang#4155)

Non-functional changes to expose
`lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by
other target backends.

* [BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196)

During some previous refactoring we changed the logic and started
pipeling cases that had incompatible shared encoding. This was missed
because one of the lit test had not been updated :(

* [Backend][AMD] Introduce stream pipeliner v2 (triton-lang#4148)

This PR first promotes common infrastructure in
`lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by
other target backends. No other changes have been made to the
lib/include directories.

Second, the `tritonamdgpu-stream-pipeline` pass has been completely
revamped based on code from
`lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp`
using similar scheduling passes to compute multi-stage pipelines. Some
of this code could be consolidated further in the CoarseSchedule class
(or perhaps a derived LoopScheduler class). This modulo scheduler
collects `tt.load` ops and generates local_storage and management ops
for the ramp-up stage (stage-0), then collecting all uses of the loads
for stage-1. Multi-buffering is introduced when num_stages exceeds the
max distance between load and uses. Buffering may be in Shared memory
for `tt.dot` uses or Registers for all other uses. This current
implement does not support peeling the last iteration if the loop is
dynamic.

Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced
to move `tt.load` ops as early as possible in its region. This includes
loop bodies as well as func entry blocks for the case of ramp-up. This
pass will also move `triton_gpu.local_store` ops as early as possible if
their source is not directly from a `tt.load`. In this way, a
multi-buffered pipeline will overlap in this order:
1. tt.load buffer+2
2. tg.local_store buffer+1
3. tt.dot buffer+0

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [AMD] Prefetch loads and independent local_stores (triton-lang#4429)

This pass is enhanced to move tt.loads as early as possible. This
enables buffering in registers for global loads while computing previous
tiles (stream-pipelining), but may increase register pressure.

If ttg.local_stores are independent of loads in the loop (i.e. double
buffering in shared memory), then this pass will also move those early
to overlap with global loads and compute.

* [Pipeliner] Implement dynamic loop peeling
 - enabled for tritonamdgpu-stream-pipeline

* * disabled for num_stages > 2
* updated tests

* * guard each stage of ramp-down in epilogue
* enable peeling for any num_stages

* * pipeline reg buffers

* [AMD] Fixed bug with tritonamdgpu-reorder-instructions
    - blindly moving local_loads can violate memory access order
    - also fixed case when moving instructions to top of loop

* * only move ops early

* Fix in streamPipelinerV2

* Fix lit tests

* [Backend][AMD] Add temporary environment variable for pipeliner v2 (triton-lang#4430)

This commit adds a new environment variable to enable pipeliner v2. It
is expected to be temporary while we enable the new pipeliner and get
all cases covered.

Co-authored-by: SJW <swaters@amd.com>

---------

Co-authored-by: Ognjen Plavsic <plognjen@amd.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Vinayak Gokhale <Vinayak.Gokhale@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Co-authored-by: SJW <48454132+sjw36@users.noreply.github.com>
Co-authored-by: SJW <swaters@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants