Support ada float8 mma based on 2.2#1
Closed
pingzhuu wants to merge 1512 commits intorelease/2.2.xfrom
Closed
Conversation
…on-lang#2275) This fixes a few bugs related to scalar tensors: - `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type is forbidden')` - `scalar[None]` fails with `TypeError("'constexpr' object is not iterable")` - `scalar[None, None]` fails with `AttributeError("'dtype' object has no attribute 'shape'")` - `scalar.shape` returns `[1]` instead of 0-dim `[]` - Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of another scalar
…e… (triton-lang#2283) …rf on problems that need few blocks. constrain the number of launched blocks to what it exactely needs for persistent warp specialized kernel. It's useful when problems need very few blocks. e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64, non-split-k. Experiments show it can achieve ~16% speedup.
…lang#2285) Add infrastructure to be able to add and test custom LLVM passes in the backend. This will allow use to apply some low level optimizations and cleanup on LLVM IR. Add a first pass that breaks up phi of struct created by lowering to LLVM. Those can often pessimise the optimizer as it would block optimizations going through phi nodes.
…T->TTGPU pass (triton-lang#2284) This is needed for forward-compatibility with MLIR that now has "inherent" and "discardable" attributes (https://mlir.llvm.org/OpenMeetings/2023-02-09-Properties.pdf) and the ExternElementwiseOp attrs do not propagate with the current `addNamedAttrs` implementation.
Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.
Low tech but very useful way to override kernels on the fly. This can be use for debugging functionality or performance problems this lets user dump modify and feed back IR into the jit compiler.
… correct swizzling code (triton-lang#2180) fix bug triton-lang#1937 Co-authored-by: Philippe Tillet <phil@openai.com>
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.
…iton-lang#2307) This fixes few problems that were preventing me to use lld linker.
Triton conf registration closed.
…lang#2300) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.
…g#2312) Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.
Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.
On my machine, when I try to `pip install cmake` outside a virtualenv, it gets mad at me and tells me to use apt. Which doesn't quite work for some reason. Anyway maybe this is simple to Python people, but perhaps worth mentioning. Especially because we have `.venv` in gitignore already.
reverts triton-lang#2310 as recent changes to Triton-IR have broken third-party backends
Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. triton-lang#2296
This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain
… a Select Op (triton-lang#2678) Summary: Fix triton-lang#2655 When there is a Select Op after a Load Op, the type of the operands for the Select Op will be different, we can't use the same newCvtTy for all the operands when creating ConvertLayoutOp. --------- Co-authored-by: Manman Ren <mren@meta.com> Co-authored-by: Manman Ren <mren@fb.com>
…#2664) `AllocMBarrierOp` and `InsertSliceAsyncV2Op` were ignored in membar, which may cause data race or extra barrier issues.
Generalize the view op into a reshape op with an attribute deciding whether re-ordering elements is allowed. When re-ordering element is not allowed we currently only handle trivial block layout and makes sure none of the passes generate a different layout.
…riton-lang#2698) In this PR we are deduplicating the llvm-hash value (currently it's in the llvm-hash.txt file but also under python/setup.py. The new location is cmake/llvm-hash.txt. Both changes were suggested by jlebar in triton-lang#2570. Test: Pushed the same change to llvm-head branch. Trigger a llvm-build (https://github.com/openai/triton/actions/runs/6970868398), it passed.
Co-authored-by: dongdongl <dongdongl@nvidia.com>
Update name from create_view to create_reshape
…c()` calls (triton-lang#2703) This PR replaces the `py::exec()` calls with native Python C API. This is much safer and eliminates the side effects on the Python side (e.g., assignment for function `warnings.showwarning`). Ref: - triton-lang#2155
We enable TMA by checking `max_divisibility` rather than `divisibility`. 16 is now supported for `max_divisibility`.
5m does not seem sufficient; jobs sometimes are timing out, e.g. https://github.com/openai/triton/actions/runs/6998821726/job/19037291030. Increase to 20m.
1. Fix a segment fault when the condition is `i1`. 2. Fix the divisibility calculation when contiguity has been changed (different than cond, lhs, or rhs).
Suppose we have a loop which yields an operand which lives outside the
loop, i.e. is loop-invariant. Moreover, suppose the same operand in
iter_args is never used.
%init = ...
%x = ...
%y = for iter_args(%unused = %init) {
yield %x
}
return %y
Previously, we would declare that operand 0 of the yield is "dead" and
replace it with `yield %init`. This is obviously not correct.
The way this happened was:
- ForOpDeadArgElimination iterates over the forOp's results, in this
case just [%y].
- We notice that %y is used, so it's not dead. We call markLive(%x),
where %x is the yielded value that corresponds to %y.
- %x is not defined in the loop body, so markLive skips it.
- Therefore at the end of the function, operand 0 of the yield is
considered dead. We change it to `yield %init`, so it can be removed
from the loop entirely in a later pass.
In this patch, we add a special case that says that a `yield` which
returns a value from outside the loop isn't dead.
Fixes triton-lang#2672
This is integrating @apgoucher's work which implements a bitonic sort function using reductions to slice tensors and selects to merge. This version currently only supports sorting the most inner dimension of a tensor. This will be extended in future changes. Co-authored-by: apgoucher <apgoucher@openai.com>
…ent kernels to improve performance of warp specialized kernels with NUM_CTAS>1 (triton-lang#2638) - Improve the logics of determining splitM&splitN - Add necessary support for persistent kernels with NUM_CTAS>1 - Add canonocal warp id query operation to help cse and licm before nvgpu2llvm pass. - Add warning info for fallback of warp specialized kernels
…e TritonGPU layout attributes. (triton-lang#2682) Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU attributes and its inheritance.
Include fixes for different LLVM updates, that includes: - Add elem type to SharedMemoryObject. - Change all pointers to opaque. - Fixes for llvm update 3cd2a0bc --------- Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: Goran Flegar <gflegar@google.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com> Co-authored-by: Tori Baker <vwbaker@google.com> Co-authored-by: Aliia Khasanova <40315403+khasanovaa@users.noreply.github.com>
It's not an iterator. :) Also minor fix to grammar in docs.
There is a corner case that may cause some use of extf source to be incorrectly replaced when hoisting extf op. Instead we should only map the use of extf we are replacing with the new convert layout op.
This affects how the docs are rendered into HTML.
triton-lang#2727) Backward rematerialization could generate wrong IR when rematerializing ops with multiple results as not all the results may have had their encoding updated.
…onHandle for CUDA version compatibility (triton-lang#2771)" (triton-lang#2789) This is needed for CUDA 11 support, which we'd like to have in the PyTorch 2.2 release. Original commit message: In case cuda 11 drivers are still used on some systems, we shouldn't call TMA and block cluster related functions directly. Instead, we can dynamically lookup the handles to avoid compatibility issues. Co-authored-by: Keren Zhou <kerenzhou@openai.com>
…ps (triton-lang#2870) By reading the current clock, our analytical calculations can vary while we're evaluating different configs. It turns out the choice of config is very sensitive to the clock, such that a slight throttling can make us reject very good configs, in favor of very bad ones. A reproducer can be found here: https://gist.github.com/bertmaher/8ff5e9631666846fff55d81326cacb4d ``` $ python thermal_throttle.py chosen config BLOCK_M: 128, BLOCK_N: 256, BLOCK_K: 32, SPLIT_K: 1, num_warps: 8, num_ctas: 1, num_stages: 3, enable_warp_specialization: False, enable_persistent: False tflops/s: 107.92460196062149 $ python thermal_throttle.py --preheat chosen config BLOCK_M: 32, BLOCK_N: 32, BLOCK_K: 32, SPLIT_K: 1, num_warps: 2, num_ctas: 1, num_stages: 6, enable_warp_specialization: False, enable_persistent: False tflops/s: 39.29629633970286 ``` Cherry-pick of triton-lang#2801 into release/2.2.x
cherry-pick into release/2.2.x branch. If `libcuda.so` can not be found using `ldconfig -P`, check if `LD_LIBRARY_PATH` environment variable is defined and search for it there Test plan: https://colab.research.google.com/drive/16Kd88j-nFS4iMI-UfJS5rukH42L6UuyP?usp=sharing Fixes triton-lang#2507 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
…riton-lang#2934) (triton-lang#3379) Adds a redis-based one as an initial implementation, but it should be straightforward to extend with more impls. Co-authored-by: andrewjcg <andrewjcg@gmail.com>
…nagers (triton-lang#2934) (triton-lang#3379)" This reverts commit be445fb. The cherry-pick should be applied to 2.3.x instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.