Skip to content

Conversation

@ravil-mobile
Copy link
Contributor

@ravil-mobile ravil-mobile commented Sep 27, 2024

[AMD] Advanced software pipelining may require fine-grain adjustments regarding instruction scheduling in the main tt.dot loop to achieve higher performance. Such adjustments require detailed information regarding the number of issued v_mfma, ds_read, ds_write and global_load, instructions. This PR extends the Triton AMDGPU backend by adding instruction counting during TritonAMDGPUToLLVM pass execution.

An example of instruction counting and instruction scheduling is demonstrated in the createCKV3Schedule method which implements the CK's V3 software pipelining.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this looks fine. But quite a few places we can simplify. Also need documentation and testing.

@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch 7 times, most recently from aece96b to 06210a4 Compare October 1, 2024 15:32
@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch from ad5a4e1 to d32f444 Compare October 2, 2024 12:41

op->getBlock()->walk([&](amdgpu::InstructionSchedHint schedHint) {
schedHint.setNumMMAsAttr(counterAttr);
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this works when there are multiple tt.dot in the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhanglx13,

No, it is not going to work. The multiple tt.dot support would require further investigation and extensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to generalize the design to support multiple tt.dot?
I'm asking because the pipelineV3 or CKV3 pipeline will prefetch the whole LDS buffer. However, the prefetchLDS pass can prefetch partial LDS buffer. But the prefetchLDS pass will lead to multiple tt.dot in the loop, each of which corresponds to one prefetched LDS sub-buffer.
The prefetchLDS pass will also need some sched_group_barrier tweak to "move things around".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I feel we may need to have more targeted instruction counting. The hint op is basically carrying side-channel information for the tt.dot; we can have one hint op immediately before/after a tt.dot for that tt.dot. It's a bit fragile but fine if we insert it at the proper time. Then we may need to build different schedules for different tt.dot ops (e.g., in main loop vs in epilogue or so). the instruction counting need to be more clever to figure out different "segments"..

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Impl looks better now. Major missing pieces are still documentation and testing..

let arguments = (ins
I32Attr:$numDsReadsTileA,
I32Attr:$numDsReadsTileB,
I32Attr:$numDsWritesTileA,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see thanks! You might want to put the link directly in the comment so it's easy to associate? (Right now what you have there is not a permlink.)

https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263


op->getBlock()->walk([&](amdgpu::InstructionSchedHint schedHint) {
schedHint.setNumMMAsAttr(counterAttr);
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I feel we may need to have more targeted instruction counting. The hint op is basically carrying side-channel information for the tt.dot; we can have one hint op immediately before/after a tt.dot for that tt.dot. It's a bit fragile but fine if we insert it at the proper time. Then we may need to build different schedules for different tt.dot ops (e.g., in main loop vs in epilogue or so). the instruction counting need to be more clever to figure out different "segments"..

@antiagainst antiagainst changed the title [AMD] instruction counting during TritonAMDGPUToLLVM pass [AMD] Count llvm instruction during conversion for scheduling hints Oct 3, 2024
@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch 9 times, most recently from d861c01 to cf97e35 Compare October 4, 2024 14:18
@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch from cf97e35 to ea01f4b Compare October 4, 2024 14:24
@ravil-mobile ravil-mobile marked this pull request as ready for review October 4, 2024 16:46
@ravil-mobile ravil-mobile requested a review from ptillet as a code owner October 4, 2024 16:46
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good now. Just need to add tests next:

  • Op tests for the new hint op
  • Conversion tests for the pass
  • etc.

@antiagainst
Copy link
Collaborator

BTW, @ravil-mobile, when you address comments, please use separate commits; don't squash everything into one commit--otherwise reviewers are required to reread everything. Separate commits allows us to only read the delta easily. Also prefer to git merge origin/main to force push--it also helps speed up code reviews. Thanks! :)

@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch 4 times, most recently from a06f1a7 to cbbc694 Compare October 10, 2024 15:09
@ravil-mobile ravil-mobile force-pushed the ravil/sched-barriers-stat branch from cbbc694 to 23b5820 Compare October 10, 2024 15:27
@antiagainst antiagainst merged commit e87f877 into triton-lang:main Oct 13, 2024
ptillet added a commit that referenced this pull request Oct 16, 2024
ptillet added a commit that referenced this pull request Oct 16, 2024
antiagainst pushed a commit that referenced this pull request Oct 31, 2024
This commit relands #4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
…riton-lang#4819)

Advanced software pipelining may require fine-grained adjustments
regarding instruction scheduling in the main `tt.dot` loop to achieve
higher performance. Such adjustments require detailed information
regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and
`global_load`, instructions. This PR extends the Triton AMDGPU backend
by adding instruction counting during `TritonAMDGPUToLLVM` pass
execution.

An example of instruction counting and instruction scheduling is
demonstrated in the `createCKV3Schedule` method which implements the
[CK's V3 software
pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263).

This change is experimental for better GEMM performance. The design
is not final and may subject to change in the future.
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
…riton-lang#4819)

Advanced software pipelining may require fine-grained adjustments
regarding instruction scheduling in the main `tt.dot` loop to achieve
higher performance. Such adjustments require detailed information
regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and
`global_load`, instructions. This PR extends the Triton AMDGPU backend
by adding instruction counting during `TritonAMDGPUToLLVM` pass
execution.

An example of instruction counting and instruction scheduling is
demonstrated in the `createCKV3Schedule` method which implements the
[CK's V3 software
pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263).

This change is experimental for better GEMM performance. The design
is not final and may subject to change in the future.
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…riton-lang#4819)

Advanced software pipelining may require fine-grained adjustments
regarding instruction scheduling in the main `tt.dot` loop to achieve
higher performance. Such adjustments require detailed information
regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and
`global_load`, instructions. This PR extends the Triton AMDGPU backend
by adding instruction counting during `TritonAMDGPUToLLVM` pass
execution.

An example of instruction counting and instruction scheduling is
demonstrated in the `createCKV3Schedule` method which implements the
[CK's V3 software
pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263).

This change is experimental for better GEMM performance. The design
is not final and may subject to change in the future.
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 12, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
jataylo added a commit to ROCm/triton that referenced this pull request Dec 13, 2024
* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009)

Allows for upcasting in DotOp encoding in RF.
This lowering path is not currently in use; pending
triton-lang#5003

(cherry picked from commit cfddb09)

* [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994)

This commit adds initial support for scaled_dot with
mxfp8 LHS and fp8 RHS. It supports both mfma32
and mfma16 intrinsic variants.

Right now we are missing software emulation for
`Float8E4M3FN` type, so this only enables for
`Float8E5M2`.

(cherry picked from commit 3549db8)

* [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996)

In the passing we also improve a few other things:
- Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before
you had to cast it to uint8, which was weird when extending it to bf16).
- Add `scaled_dot` to the docs and improve the docs overall (have not
render them, might need a few further tweaks)

(cherry picked from commit 23c9ec1)

* [BACKEND] Improve detection of register to register conversion (triton-lang#4991)

Specifically, it fixes problems when `srcLayout` and `dstLayout` have
different number of registers but the same number of not free registers.
We solved the problem by padding free registers to either `srcLayout` or
`dstLayout`, but this can be improved by fixing the `invertAndCompose`
function.

(cherry picked from commit 15c5e55)

* [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951)

This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

(cherry picked from commit 1d5fdfe)

* [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038)

We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary
functions to make the intentions a bit clearer.

We add a few asserts in the `LinearLayoutConversion` to make sure it's
clear why we do certain things here and there.

We also kill `getCvtOrder`, as it was not used anywhere

(cherry picked from commit 56584c4)

* [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)

* [AMD] Reland instruction scheduling hint changes (triton-lang#4940)

This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)

* [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029)

(cherry picked from commit f062540)

* [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034)

This commit adds support for mxfp4 typed A tensor
for sacled dot in the AMD backend.

We moved the `convertMxfp4x2ToBf16x2` impl
from NVIDIA side to a common path to reuse.

(cherry picked from commit edc5c5c)

* [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030)

Two bugfixes following triton-lang#5009.

- When `BLOCK_M=64` and `num_warps > 4`, the order of warps for
DotOpEncoded tensor should be M-major instead of N-major, since WGMMA
expects the 4 warps in each warp group to be stacked along the M
dimension.
- Should use `mmaBitwidth` instead of `bitwidth` when calculating
`numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad
rebase.

@lezcano I encountered these bugs when attempting to locally test the
[DotOp hoisting PR](triton-lang#5003)
after rebasing (they normally would be caught by `test_core.py` but that
path was not yet enabled in the last PR). With these fixes added, I was
able to successfully validate against pytorch.

(cherry picked from commit e82dfd9)
(cherry picked from commit 5287a68)

* [BACKEND] Get rid of unpack/pack I32  (triton-lang#5044)

- Removed functions related to unpacking and packing I32 values.
- Updated utilities to handle conversion of mxfp4 values without
packing/unpacking I32.
- Move the register value ordering logic from the element-wise operation
lowering to the dot operation lowering.
- Use linear layout to handle conversions between almost all distributed
layouts.
- Clean up data loading and mma computation involving `repN`, `repK`,
and `repM`.

(cherry picked from commit 1cf7b1b)
(cherry picked from commit 376fe7e)

* Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089)

This partially reverts commit 38a11b8.
Supersedes triton-lang#5085

It also documents that we are implicitly choosing a way to tile a
full tensor depending on the layout. See
triton-lang#5085 (comment)

(cherry picked from commit 57643b3)
(cherry picked from commit ffb2032)

---------

Co-authored-by: Gary Geng <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Mario Lezcano Casado <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: ravil-mobile <[email protected]>
jataylo pushed a commit to ROCm/triton that referenced this pull request Jan 28, 2025
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
(cherry picked from commit 8dd9226)
jataylo pushed a commit to jataylo/triton that referenced this pull request Feb 10, 2025
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
(cherry picked from commit 8dd9226)
jataylo pushed a commit to jataylo/triton that referenced this pull request Mar 27, 2025
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
(cherry picked from commit 8dd9226)
(cherry picked from commit aed3efc)
jataylo added a commit to ROCm/triton that referenced this pull request Mar 27, 2025
* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925)

In the case of 16 bit floats operands for tt::AtomicRMWOp, construct
only one LLVM::AtomicRMWOp but use vector of elements.
Such approach allows to generate packed intrinsics and process 2
elements at once.
Added a lit test for f16 vectorized case.

(cherry picked from commit 78c8054)
(cherry picked from commit 4d70942)
(cherry picked from commit 2f8aacc)

* [AMD] Restructure ReorderInstructions pass (triton-lang#4998)

(cherry picked from commit 86a2ac7)
(cherry picked from commit 4c7d56e)
(cherry picked from commit 0529343)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
(cherry picked from commit d0142d3)
(cherry picked from commit 9f2b69b)

* [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

TritonAMDGPUTransforms now depends on it.

(cherry picked from commit 0b443ce)
(cherry picked from commit 37cec47)
(cherry picked from commit 1ab334d)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
(cherry picked from commit ca8842c)
(cherry picked from commit 3a3902d)

* [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935)

This PR adds more restrictions about when should we apply
the sched-load optimizations and un-revert
triton-lang#4823.

We will only apply the optimization when all of the following is
satisfied:
1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop
2. two `tt.load`s in the main loop
3. 2nd `tt.load` is ahead of the `tt.dot`
4. 1st user of 2nd `tt.load` is after the `tt.dot`
5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

(cherry picked from commit 4f6f768)
(cherry picked from commit f6053a3)
(cherry picked from commit 72d1575)

* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009)

Allows for upcasting in DotOp encoding in RF.
This lowering path is not currently in use; pending
triton-lang#5003

(cherry picked from commit cfddb09)
(cherry picked from commit f8c2c30)
(cherry picked from commit 73ef337)

* [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994)

This commit adds initial support for scaled_dot with
mxfp8 LHS and fp8 RHS. It supports both mfma32
and mfma16 intrinsic variants.

Right now we are missing software emulation for
`Float8E4M3FN` type, so this only enables for
`Float8E5M2`.

(cherry picked from commit 3549db8)
(cherry picked from commit efe0ec4)
(cherry picked from commit 010fe45)

* [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996)

In the passing we also improve a few other things:
- Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before
you had to cast it to uint8, which was weird when extending it to bf16).
- Add `scaled_dot` to the docs and improve the docs overall (have not
render them, might need a few further tweaks)

(cherry picked from commit 23c9ec1)
(cherry picked from commit 675758b)
(cherry picked from commit 4e04af0)

* [BACKEND] Improve detection of register to register conversion (triton-lang#4991)

Specifically, it fixes problems when `srcLayout` and `dstLayout` have
different number of registers but the same number of not free registers.
We solved the problem by padding free registers to either `srcLayout` or
`dstLayout`, but this can be improved by fixing the `invertAndCompose`
function.

(cherry picked from commit 15c5e55)
(cherry picked from commit 6537eb6)
(cherry picked from commit 4ca5013)

* [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951)

This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

(cherry picked from commit 1d5fdfe)
(cherry picked from commit fc6d96b)
(cherry picked from commit 9f67c54)

* [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038)

We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary
functions to make the intentions a bit clearer.

We add a few asserts in the `LinearLayoutConversion` to make sure it's
clear why we do certain things here and there.

We also kill `getCvtOrder`, as it was not used anywhere

(cherry picked from commit 56584c4)
(cherry picked from commit 276d182)
(cherry picked from commit 72651c2)

* [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)
(cherry picked from commit 8412154)
(cherry picked from commit a569c3e)

* [AMD] Reland instruction scheduling hint changes (triton-lang#4940)

This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)
(cherry picked from commit 8dd9226)
(cherry picked from commit aed3efc)

* [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029)

(cherry picked from commit f062540)
(cherry picked from commit ca75b5f)
(cherry picked from commit 98149dd)

* [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034)

This commit adds support for mxfp4 typed A tensor
for sacled dot in the AMD backend.

We moved the `convertMxfp4x2ToBf16x2` impl
from NVIDIA side to a common path to reuse.

(cherry picked from commit edc5c5c)
(cherry picked from commit ac9f0d0)
(cherry picked from commit c0710dc)

* [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030)

Two bugfixes following triton-lang#5009.

- When `BLOCK_M=64` and `num_warps > 4`, the order of warps for
DotOpEncoded tensor should be M-major instead of N-major, since WGMMA
expects the 4 warps in each warp group to be stacked along the M
dimension.
- Should use `mmaBitwidth` instead of `bitwidth` when calculating
`numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad
rebase.

@lezcano I encountered these bugs when attempting to locally test the
[DotOp hoisting PR](triton-lang#5003)
after rebasing (they normally would be caught by `test_core.py` but that
path was not yet enabled in the last PR). With these fixes added, I was
able to successfully validate against pytorch.

(cherry picked from commit e82dfd9)
(cherry picked from commit 5287a68)
(cherry picked from commit 8d70247)
(cherry picked from commit 302de9d)

* [BACKEND] Get rid of unpack/pack I32  (triton-lang#5044)

- Removed functions related to unpacking and packing I32 values.
- Updated utilities to handle conversion of mxfp4 values without
packing/unpacking I32.
- Move the register value ordering logic from the element-wise operation
lowering to the dot operation lowering.
- Use linear layout to handle conversions between almost all distributed
layouts.
- Clean up data loading and mma computation involving `repN`, `repK`,
and `repM`.

(cherry picked from commit 1cf7b1b)
(cherry picked from commit 376fe7e)
(cherry picked from commit 2141a4e)
(cherry picked from commit d0e4abc)

* Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089)

This partially reverts commit 38a11b8.
Supersedes triton-lang#5085

It also documents that we are implicitly choosing a way to tile a
full tensor depending on the layout. See
triton-lang#5085 (comment)

(cherry picked from commit 57643b3)
(cherry picked from commit ffb2032)
(cherry picked from commit a11c6be)
(cherry picked from commit 9d080b4)

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Gary Geng <[email protected]>
Co-authored-by: Mario Lezcano Casado <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: ravil-mobile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants