Skip to content

Support ada float8 mma based on 2.2#1

Closed
pingzhuu wants to merge 1512 commits intorelease/2.2.xfrom
triton-2.2-fp8
Closed

Support ada float8 mma based on 2.2#1
pingzhuu wants to merge 1512 commits intorelease/2.2.xfrom
triton-2.2-fp8

Conversation

@pingzhuu
Copy link
Copy Markdown

@pingzhuu pingzhuu commented Apr 4, 2024

No description provided.

ptillet and others added 30 commits September 11, 2023 20:54
…on-lang#2275)

This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
…e… (triton-lang#2283)

…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
…lang#2285)

Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
…T->TTGPU pass (triton-lang#2284)

This is needed for forward-compatibility with MLIR that now has
"inherent" and "discardable" attributes
(https://mlir.llvm.org/OpenMeetings/2023-02-09-Properties.pdf) and the
ExternElementwiseOp attrs do not propagate with the current
`addNamedAttrs` implementation.
Try to move broadcast ops after arithmetic and convert ops in order to
reduce the amount of work needed.
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
… correct swizzling code (triton-lang#2180)

fix bug triton-lang#1937

Co-authored-by: Philippe Tillet <phil@openai.com>
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
…iton-lang#2307)

This fixes few problems that were preventing me to use lld linker.
Triton conf registration closed.
…lang#2300)

Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
…g#2312)

Move the optimization to remove phi of struct later in the optimization
pipeline to avoid interfering with CFG optimization.
Otherwise, these files show up in `git status` under
python/triton/third_party/cuda/bin/.
On my machine, when I try to `pip install cmake` outside a virtualenv,
it gets mad at me and tells me to use apt.  Which doesn't quite work for
some reason.  Anyway maybe this is simple to Python people, but perhaps
worth mentioning.  Especially because we have `.venv` in gitignore
already.
reverts triton-lang#2310 as recent changes to Triton-IR have broken third-party backends
Previously on matmul, if inputs are int8, output was also int8.
This commit fixes the overflow problem with int32 output.
triton-lang#2296
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
manman-ren and others added 28 commits November 21, 2023 11:12
… a Select Op (triton-lang#2678)

Summary: Fix triton-lang#2655 When there is
a Select Op after a Load Op, the type of the operands for the Select Op
will be different, we can't use the same newCvtTy for all the operands
when creating ConvertLayoutOp.

---------

Co-authored-by: Manman Ren <mren@meta.com>
Co-authored-by: Manman Ren <mren@fb.com>
…#2664)

`AllocMBarrierOp` and `InsertSliceAsyncV2Op` were ignored in membar, which may cause data race or extra barrier issues.
Generalize the view op into a reshape op with an attribute deciding
whether re-ordering elements is allowed.
When re-ordering element is not allowed we currently only handle trivial
block layout and makes sure none of the passes generate a different
layout.
…riton-lang#2698)

In this PR we are deduplicating the llvm-hash value (currently it's in
the llvm-hash.txt file but also under python/setup.py. The new location
is cmake/llvm-hash.txt.

Both changes were suggested by jlebar in
triton-lang#2570.

Test: Pushed the same change to llvm-head branch. Trigger a llvm-build
(https://github.com/openai/triton/actions/runs/6970868398), it passed.
Co-authored-by: dongdongl <dongdongl@nvidia.com>
Update name from create_view to create_reshape
…c()` calls (triton-lang#2703)

This PR replaces the `py::exec()` calls with native Python C API. This
is much safer and eliminates the side effects on the Python side (e.g.,
assignment for function `warnings.showwarning`).

Ref:

- triton-lang#2155
We enable TMA by checking `max_divisibility` rather than `divisibility`.
16 is now supported for `max_divisibility`.
1. Fix a segment fault when the condition is `i1`.
2. Fix the divisibility calculation when contiguity has been changed
(different than cond, lhs, or rhs).
Suppose we have a loop which yields an operand which lives outside the
loop, i.e. is loop-invariant.  Moreover, suppose the same operand in
iter_args is never used.

    %init = ...
    %x = ...
    %y = for iter_args(%unused = %init) {
      yield %x
    }
    return %y

Previously, we would declare that operand 0 of the yield is "dead" and
replace it with `yield %init`.  This is obviously not correct.

The way this happened was:

 - ForOpDeadArgElimination iterates over the forOp's results, in this
   case just [%y].
 - We notice that %y is used, so it's not dead.  We call markLive(%x),
   where %x is the yielded value that corresponds to %y.
 - %x is not defined in the loop body, so markLive skips it.
 - Therefore at the end of the function, operand 0 of the yield is
   considered dead.  We change it to `yield %init`, so it can be removed
   from the loop entirely in a later pass.

In this patch, we add a special case that says that a `yield` which
returns a value from outside the loop isn't dead.

Fixes triton-lang#2672
This is integrating @apgoucher's work which implements a bitonic sort
function using reductions to slice tensors and selects to merge.

This version currently only supports sorting the most inner dimension of
a tensor. This will be extended in future changes.

Co-authored-by: apgoucher <apgoucher@openai.com>
…ent kernels to improve performance of warp specialized kernels with NUM_CTAS>1 (triton-lang#2638)

- Improve the logics of determining splitM&splitN
- Add necessary support for persistent kernels with NUM_CTAS>1
- Add canonocal warp id query operation to help cse and licm before
nvgpu2llvm pass.
- Add warning info for fallback of warp specialized kernels
…e TritonGPU layout attributes. (triton-lang#2682)

Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU
attributes and its inheritance.
Include fixes for different LLVM updates, that includes:
- Add elem type to SharedMemoryObject.
- Change all pointers to opaque.
- Fixes for llvm update 3cd2a0bc

---------

Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
Co-authored-by: Goran Flegar <gflegar@google.com>
Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>
Co-authored-by: Tori Baker <vwbaker@google.com>
Co-authored-by: Aliia Khasanova <40315403+khasanovaa@users.noreply.github.com>
It's not an iterator.  :)

Also minor fix to grammar in docs.
There is a corner case that may cause some use of extf source to be
incorrectly replaced when hoisting extf op.
Instead we should only map the use of extf we are replacing with the new
convert layout op.
This affects how the docs are rendered into HTML.
triton-lang#2727)

Backward rematerialization could generate wrong IR when rematerializing
ops with multiple results as not all the results may have had their
encoding updated.
…onHandle for CUDA version compatibility (triton-lang#2771)" (triton-lang#2789)

This is needed for CUDA 11 support, which we'd like to have in the
PyTorch 2.2 release.

Original commit message:

In case cuda 11 drivers are still used on some systems, we shouldn't
call TMA and block cluster related functions directly. Instead, we can
dynamically lookup the handles to avoid compatibility issues.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
…ps (triton-lang#2870)

By reading the current clock, our analytical calculations can vary while
we're evaluating different configs. It turns out the choice of config is
very sensitive to the clock, such that a slight throttling can make us
reject very good configs, in favor of very bad ones.

A reproducer can be found here:
https://gist.github.com/bertmaher/8ff5e9631666846fff55d81326cacb4d

```
$ python thermal_throttle.py
chosen config BLOCK_M: 128, BLOCK_N: 256, BLOCK_K: 32, SPLIT_K: 1, num_warps: 8, num_ctas: 1, num_stages: 3, enable_warp_specialization: False, enable_persistent: False
tflops/s: 107.92460196062149

$ python thermal_throttle.py --preheat
chosen config BLOCK_M: 32, BLOCK_N: 32, BLOCK_K: 32, SPLIT_K: 1, num_warps: 2, num_ctas: 1, num_stages: 6, enable_warp_specialization: False, enable_persistent: False
tflops/s: 39.29629633970286
```

Cherry-pick of  triton-lang#2801 into release/2.2.x
cherry-pick into release/2.2.x branch.

If `libcuda.so` can not be found using `ldconfig -P`, check if
`LD_LIBRARY_PATH` environment variable is defined and search for it
there

Test plan:

https://colab.research.google.com/drive/16Kd88j-nFS4iMI-UfJS5rukH42L6UuyP?usp=sharing

Fixes triton-lang#2507

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
…riton-lang#2934) (triton-lang#3379)

Adds a redis-based one as an initial implementation, but it should be
straightforward to extend with more impls.

Co-authored-by: andrewjcg <andrewjcg@gmail.com>
…nagers (triton-lang#2934) (triton-lang#3379)"

This reverts commit be445fb.

The cherry-pick should be applied to 2.3.x instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.