Support ada float8 mma based on 2.2 by pingzhuu · Pull Request #1 · siliconflow/triton

pingzhuu · 2024-04-04T10:38:38Z

No description provided.

…on-lang#2275) This fixes a few bugs related to scalar tensors: - `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type is forbidden')` - `scalar[None]` fails with `TypeError("'constexpr' object is not iterable")` - `scalar[None, None]` fails with `AttributeError("'dtype' object has no attribute 'shape'")` - `scalar.shape` returns `[1]` instead of 0-dim `[]` - Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of another scalar

…e… (triton-lang#2283) …rf on problems that need few blocks. constrain the number of launched blocks to what it exactely needs for persistent warp specialized kernel. It's useful when problems need very few blocks. e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64, non-split-k. Experiments show it can achieve ~16% speedup.

Add PyTorch fp8 dtypes (https://github.com/pytorch/pytorch/blob/8025b193a966a6d8e3afc9c03a54e577bc04eb3d/torchgen/api/types/types.py#L50-L51) to Triton.

…-lang#2282)

…lang#2285) Add infrastructure to be able to add and test custom LLVM passes in the backend. This will allow use to apply some low level optimizations and cleanup on LLVM IR. Add a first pass that breaks up phi of struct created by lowering to LLVM. Those can often pessimise the optimizer as it would block optimizations going through phi nodes.

…T->TTGPU pass (triton-lang#2284) This is needed for forward-compatibility with MLIR that now has "inherent" and "discardable" attributes (https://mlir.llvm.org/OpenMeetings/2023-02-09-Properties.pdf) and the ExternElementwiseOp attrs do not propagate with the current `addNamedAttrs` implementation.

Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.

Low tech but very useful way to override kernels on the fly. This can be use for debugging functionality or performance problems this lets user dump modify and feed back IR into the jit compiler.

… correct swizzling code (triton-lang#2180) fix bug triton-lang#1937 Co-authored-by: Philippe Tillet <phil@openai.com>

…ation (triton-lang#2292)

- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.

…iton-lang#2307) This fixes few problems that were preventing me to use lld linker.

Triton conf registration closed.

…lang#2300) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.

…g#2312) Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.

Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.

On my machine, when I try to `pip install cmake` outside a virtualenv, it gets mad at me and tells me to use apt. Which doesn't quite work for some reason. Anyway maybe this is simple to Python people, but perhaps worth mentioning. Especially because we have `.venv` in gitignore already.

reverts triton-lang#2310 as recent changes to Triton-IR have broken third-party backends

…ng#2308) Fixes: triton-lang#2302

Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. triton-lang#2296

This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain

… a Select Op (triton-lang#2678) Summary: Fix triton-lang#2655 When there is a Select Op after a Load Op, the type of the operands for the Select Op will be different, we can't use the same newCvtTy for all the operands when creating ConvertLayoutOp. --------- Co-authored-by: Manman Ren <mren@meta.com> Co-authored-by: Manman Ren <mren@fb.com>

…#2664) `AllocMBarrierOp` and `InsertSliceAsyncV2Op` were ignored in membar, which may cause data race or extra barrier issues.

Generalize the view op into a reshape op with an attribute deciding whether re-ordering elements is allowed. When re-ordering element is not allowed we currently only handle trivial block layout and makes sure none of the passes generate a different layout.

…riton-lang#2698) In this PR we are deduplicating the llvm-hash value (currently it's in the llvm-hash.txt file but also under python/setup.py. The new location is cmake/llvm-hash.txt. Both changes were suggested by jlebar in triton-lang#2570. Test: Pushed the same change to llvm-head branch. Trigger a llvm-build (https://github.com/openai/triton/actions/runs/6970868398), it passed.

Co-authored-by: dongdongl <dongdongl@nvidia.com>

Update name from create_view to create_reshape

…c()` calls (triton-lang#2703) This PR replaces the `py::exec()` calls with native Python C API. This is much safer and eliminates the side effects on the Python side (e.g., assignment for function `warnings.showwarning`). Ref: - triton-lang#2155

We enable TMA by checking `max_divisibility` rather than `divisibility`. 16 is now supported for `max_divisibility`.

5m does not seem sufficient; jobs sometimes are timing out, e.g. https://github.com/openai/triton/actions/runs/6998821726/job/19037291030. Increase to 20m.

1. Fix a segment fault when the condition is `i1`. 2. Fix the divisibility calculation when contiguity has been changed (different than cond, lhs, or rhs).

Suppose we have a loop which yields an operand which lives outside the loop, i.e. is loop-invariant. Moreover, suppose the same operand in iter_args is never used. %init = ... %x = ... %y = for iter_args(%unused = %init) { yield %x } return %y Previously, we would declare that operand 0 of the yield is "dead" and replace it with `yield %init`. This is obviously not correct. The way this happened was: - ForOpDeadArgElimination iterates over the forOp's results, in this case just [%y]. - We notice that %y is used, so it's not dead. We call markLive(%x), where %x is the yielded value that corresponds to %y. - %x is not defined in the loop body, so markLive skips it. - Therefore at the end of the function, operand 0 of the yield is considered dead. We change it to `yield %init`, so it can be removed from the loop entirely in a later pass. In this patch, we add a special case that says that a `yield` which returns a value from outside the loop isn't dead. Fixes triton-lang#2672

@apgoucher

This is integrating @apgoucher's work which implements a bitonic sort function using reductions to slice tensors and selects to merge. This version currently only supports sorting the most inner dimension of a tensor. This will be extended in future changes. Co-authored-by: apgoucher <apgoucher@openai.com>

…ent kernels to improve performance of warp specialized kernels with NUM_CTAS>1 (triton-lang#2638) - Improve the logics of determining splitM&splitN - Add necessary support for persistent kernels with NUM_CTAS>1 - Add canonocal warp id query operation to help cse and licm before nvgpu2llvm pass. - Add warning info for fallback of warp specialized kernels

…e TritonGPU layout attributes. (triton-lang#2682) Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU attributes and its inheritance.

Include fixes for different LLVM updates, that includes: - Add elem type to SharedMemoryObject. - Change all pointers to opaque. - Fixes for llvm update 3cd2a0bc --------- Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: Goran Flegar <gflegar@google.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com> Co-authored-by: Tori Baker <vwbaker@google.com> Co-authored-by: Aliia Khasanova <40315403+khasanovaa@users.noreply.github.com>

It's not an iterator. :) Also minor fix to grammar in docs.

There is a corner case that may cause some use of extf source to be incorrectly replaced when hoisting extf op. Instead we should only map the use of extf we are replacing with the new convert layout op.

This affects how the docs are rendered into HTML.

triton-lang#2727) Backward rematerialization could generate wrong IR when rematerializing ops with multiple results as not all the results may have had their encoding updated.

…onHandle for CUDA version compatibility (triton-lang#2771)" (triton-lang#2789) This is needed for CUDA 11 support, which we'd like to have in the PyTorch 2.2 release. Original commit message: In case cuda 11 drivers are still used on some systems, we shouldn't call TMA and block cluster related functions directly. Instead, we can dynamically lookup the handles to avoid compatibility issues. Co-authored-by: Keren Zhou <kerenzhou@openai.com>

…ps (triton-lang#2870) By reading the current clock, our analytical calculations can vary while we're evaluating different configs. It turns out the choice of config is very sensitive to the clock, such that a slight throttling can make us reject very good configs, in favor of very bad ones. A reproducer can be found here: https://gist.github.com/bertmaher/8ff5e9631666846fff55d81326cacb4d ``` $ python thermal_throttle.py chosen config BLOCK_M: 128, BLOCK_N: 256, BLOCK_K: 32, SPLIT_K: 1, num_warps: 8, num_ctas: 1, num_stages: 3, enable_warp_specialization: False, enable_persistent: False tflops/s: 107.92460196062149 $ python thermal_throttle.py --preheat chosen config BLOCK_M: 32, BLOCK_N: 32, BLOCK_K: 32, SPLIT_K: 1, num_warps: 2, num_ctas: 1, num_stages: 6, enable_warp_specialization: False, enable_persistent: False tflops/s: 39.29629633970286 ``` Cherry-pick of triton-lang#2801 into release/2.2.x

cherry-pick into release/2.2.x branch. If `libcuda.so` can not be found using `ldconfig -P`, check if `LD_LIBRARY_PATH` environment variable is defined and search for it there Test plan: https://colab.research.google.com/drive/16Kd88j-nFS4iMI-UfJS5rukH42L6UuyP?usp=sharing Fixes triton-lang#2507 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

…riton-lang#2934) (triton-lang#3379) Adds a redis-based one as an initial implementation, but it should be straightforward to extend with more impls. Co-authored-by: andrewjcg <andrewjcg@gmail.com>

…nagers (triton-lang#2934) (triton-lang#3379)" This reverts commit be445fb. The cherry-pick should be applied to 2.3.x instead.

ptillet and others added 30 commits September 11, 2023 20:54

[FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (tri…

17692b5

…ton-lang#2281)

[BACKEND] Unify slow/fast reduce codegen (triton-lang#2220)

2234379

[FRONTEND] Add PyTorch fp8 dtypes to Triton (triton-lang#2279)

69f9f1a

Add PyTorch fp8 dtypes (https://github.com/pytorch/pytorch/blob/8025b193a966a6d8e3afc9c03a54e577bc04eb3d/torchgen/api/types/types.py#L50-L51) to Triton.

[BACKEND] Remove dependency between NVGPU and TritonNvidiaGPU (triton…

21a53aa

…-lang#2282)

[BACKEND] Convert layout illegal mem access fix (triton-lang#2287)

b1efa36

[BACKEND] Optimization to sink broadcast ops (triton-lang#2274)

a21883c

Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.

[FRONTEND] Override prototype (triton-lang#2214)

9f1ea0a

Low tech but very useful way to override kernels on the fly. This can be use for debugging functionality or performance problems this lets user dump modify and feed back IR into the jit compiler.

[DOCS] add missing docs (triton-lang#2154)

78f0cb6

[OPTIMIZER] Fix Shared layout in OptimizeDotOperands pass to generate…

139b724

… correct swizzling code (triton-lang#2180) fix bug triton-lang#1937 Co-authored-by: Philippe Tillet <phil@openai.com>

[BACKEND] Fixing assert in shared encoding swizzling addresses calcul…

76209f4

…ation (triton-lang#2292)

[FRONTEND] Added SASS to asm dict (triton-lang#2280)

f2b6155

[FRONTEND] Accommodate new triton IR format (triton-lang#2294)

b4af86d

- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.

[BUILD] Fix few dependencies and layering issues to make lld work (tr…

84a833f

…iton-lang#2307) This fixes few problems that were preventing me to use lld linker.

[DOCS] update README.md (triton-lang#2311)

9560903

Triton conf registration closed.

[FRONTEND] Add sass to asm dict with lazy evaluation (triton-lang#2309)

8ed9a23

[CI] update integration-tests.yml (triton-lang#2310)

3806a75

[BACKEND] Move struct optimization down the LLVM pipeline (triton-lan…

bc66eb9

…g#2312) Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.

Add cuobjdump and nvsisasm to gitignore. (triton-lang#2319)

5c77029

Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.

Revert "Update integration-tests.yml" (triton-lang#2323)

6b641f1

reverts triton-lang#2310 as recent changes to Triton-IR have broken third-party backends

[BUILD] use ninja (triton-lang#2318)

8db5c9c

[FRONTEND] Explicitly forbid dot(.., out_dtype=bfloat16) (triton-la…

735b8e8

…ng#2308) Fixes: triton-lang#2302

[FRONTEND] fix xpu stages logic (triton-lang#2305)

fb82852

[FRONTEND] fix matmul int8 overflow issue (triton-lang#2297)

9be4a79

Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. triton-lang#2296

[FRONTEND] interpreter rewrite (triton-lang#2321)

14665a7

This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain

manman-ren and others added 28 commits November 21, 2023 11:12

[ANALYSIS] Fix membar issues on hopper and clean up code (triton-lang…

e856a4b

…#2664) `AllocMBarrierOp` and `InsertSliceAsyncV2Op` were ignored in membar, which may cause data race or extra barrier issues.

[FRONTEND] Fix bug when stride is constant (triton-lang#2697)

d9e12c2

Co-authored-by: dongdongl <dongdongl@nvidia.com>

[FRONTEND] Fix interpreter after reshape changes (triton-lang#2705)

08616ce

Update name from create_view to create_reshape

[BACKEND] Fix a bug for TMA (triton-lang#2674)

b6c51e1

We enable TMA by checking `max_divisibility` rather than `divisibility`. 16 is now supported for `max_divisibility`.

Increase documentation workflow timeout. (triton-lang#2706)

3684191

5m does not seem sufficient; jobs sometimes are timing out, e.g. https://github.com/openai/triton/actions/runs/6998821726/job/19037291030. Increase to 20m.

[ANALYSIS] Fix AxisInfoAnalysis for arith.select (triton-lang#2711)

885db75

1. Fix a segment fault when the condition is `i1`. 2. Fix the divisibility calculation when contiguity has been changed (different than cond, lhs, or rhs).

[BACKEND] Refactor the TritonGPU dialect utils to call the APIs of th…

f6fa7d3

…e TritonGPU layout attributes. (triton-lang#2682) Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU attributes and its inheritance.

[DOCS] Don't include multiple_of under Iterators. (triton-lang#2721)

3d370fc

It's not an iterator. :) Also minor fix to grammar in docs.

[BACKEND] Fix hoisting of convert on top of extf (triton-lang#2722)

690a1fb

There is a corner case that may cause some use of extf source to be incorrectly replaced when hoisting extf op. Instead we should only map the use of extf we are replacing with the new convert layout op.

Fix whitespace in rst docs. (triton-lang#2723)

de5da10

This affects how the docs are rendered into HTML.

[BACKEND] Fix backward rematerialization for ops with multiple results (

de8d34c

triton-lang#2727) Backward rematerialization could generate wrong IR when rematerializing ops with multiple results as not all the results may have had their encoding updated.

Version: 2.2.0

bfbaef6

[CI] update wheels workflow

94d5505

[CherryPick] [FRONTEND] Add support for using remote cache managers (t…

be445fb

…riton-lang#2934) (triton-lang#3379) Adds a redis-based one as an initial implementation, but it should be straightforward to extend with more impls. Co-authored-by: andrewjcg <andrewjcg@gmail.com>

Revert "[CherryPick] [FRONTEND] Add support for using remote cache ma…

bcc3a83

…nagers (triton-lang#2934) (triton-lang#3379)" This reverts commit be445fb. The cherry-pick should be applied to 2.3.x instead.

triton 2.2 support fp8

a269b1c

refine

5995d80

pingzhuu closed this Apr 4, 2024

pingzhuu force-pushed the release/2.2.x branch from bcc3a83 to 0e7b97b Compare April 4, 2024 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ada float8 mma based on 2.2#1

Support ada float8 mma based on 2.2#1
pingzhuu wants to merge 1512 commits intorelease/2.2.xfrom
triton-2.2-fp8

pingzhuu commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pingzhuu commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants