Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3436 commits
Select commit Hold shift + click to select a range
a1f9a14
[AMD][GFX1250] Split unsupported async store widths (#9994)
yiqian1 Apr 14, 2026
f43dff6
Revert "[NVIDIA] Support swizzle 0 TMA + MMA for Hopper and Blackwell…
ThomasRaoux Apr 14, 2026
8324fad
Fix constraints in isExpensiveCat (#9995)
neildhar Apr 15, 2026
035bcb5
[TESTS] Run all tests in CI (#10027)
lezcano Apr 15, 2026
f0b7641
[BACKEND] Consider TMA variants in Triton passes (#10014)
lezcano Apr 15, 2026
de7ebd9
[BACKEND] Make TMAReduce more robust (#10013)
lezcano Apr 15, 2026
e5e1621
[PROTON][TEST] Ensure nodes without any metrics are not dumped to the…
Jokeren Apr 15, 2026
e700f46
Add a verifier for CatOp (#9996)
neildhar Apr 15, 2026
6cbdfee
[triton-ext] Include `Version.h` during installation (#10023)
abrown Apr 15, 2026
ab1f012
[Gluon] Expose TMA atomic ops (#10040)
lezcano Apr 15, 2026
441faac
[RELAND] Infer src/dst of allowReorder reshape (#9997)
neildhar Apr 15, 2026
12138f4
[CONSAN] Handle memdesc selects in buffer region analysis (#10031)
pawelszczerbuk Apr 15, 2026
7430fe9
[nvidia] Always insert bar sync before all mbarrier arrives (#10035)
Mogball Apr 15, 2026
87c7072
[FPSAN] Add missing barrier to the recently introduced test (#10043)
pawelszczerbuk Apr 15, 2026
ba1ed62
[RELAND] Verify allowReorder reshapes (#9998)
neildhar Apr 15, 2026
58816a1
[Tools][Translator] Add AMD backend support for Triton-to-Gluon trans…
jammm Apr 15, 2026
6ea516a
[Backend] Bump to llvm/llvm-project@87717bf9f81f (#9992)
antiagainst Apr 16, 2026
746064c
[FPSAN] Fix crash on incorrect layout for tmem copy (#10046)
pawelszczerbuk Apr 16, 2026
97b099b
[FPSan] explicit lowerings for fmin/fmax (#10039)
apgoucher Apr 16, 2026
5dc1b24
Fix LLVMDILocalVariable pass crash on external function declarations …
dev-tomek Apr 16, 2026
6ee5472
[GLUON] Infer CLC multicast from multicta (#10051)
lezcano Apr 16, 2026
d326357
[AMD][CI] Switch to snapshot repos to workaround mirror issue (#10054)
antiagainst Apr 16, 2026
eb5efe2
[BACKEND] Implement multiCTA support for TMA gather/scatter (#9977)
lezcano Apr 16, 2026
fa4db31
Fix bench_mlp.py (#9919)
CliveUnger Apr 16, 2026
f62f95b
[triton] Python typing improvements (#10048)
Mogball Apr 16, 2026
e9352e2
[BACKEND] Model async TMA variants in ConSan (#10015)
lezcano Apr 16, 2026
3be1a23
Limit warp specialization partition attrs to WS pass (#10058)
ThomasRaoux Apr 16, 2026
23f4e52
[gluon][examples] MoE bmm1 in Gluon (#10047)
Mogball Apr 17, 2026
fe0c38b
Use TargetInfo for shared memory accesses (#10038)
neildhar Apr 17, 2026
07ef288
[GSan] Model TMAReduceOp as atomic (#10053)
peterbell10 Apr 17, 2026
3123400
[matmul kernel] [nvfp4] Use flex ctx out scale - to support tensor sc…
tristan-oai Apr 18, 2026
043ba3e
[gluon][translator] Upstream some fixes (#10069)
Mogball Apr 18, 2026
2796cea
[AMD] Clean up shuffleXor implementation (#10065)
FrederickVu Apr 18, 2026
0ee2ec2
[AMD][gfx1250] Improve gluon f16 gemm kernel pipeline (#10057)
guacamoleo Apr 18, 2026
a303a03
[Consan] Support CLC (#10052)
lezcano Apr 18, 2026
ca6bd0c
[TritonGPU] Coalesce integer atomics (#10059)
jack-white1 Apr 19, 2026
cdac714
[AMD][CI] Disable ASAN tests due to non-deterministic failures (#10071)
antiagainst Apr 19, 2026
6684293
Fix maskSpanAffineOffset bitmask in ldmatrix/stmatrix subslice check …
ianbarber Apr 20, 2026
9e59ac0
[BACKEND] Extend support for small MMAv2 FP64: single 8x8x4 instructi…
mwichro Apr 20, 2026
cc8ebcf
[AMD][gfx1250] Revert col-major support for TDM (#10078)
AlexAUT Apr 20, 2026
ec35f3d
[Gluon] Rename async_copy_... -> async_{load,store} (#10083)
peterbell10 Apr 20, 2026
6533931
Revert "[AMD][CI] Switch to snapshot repos to workaround mirror issue…
antiagainst Apr 20, 2026
88e8e52
Improve cache modifiers support for loads (#9936)
alefimov-amd Apr 20, 2026
ee5bc26
[AMD] Enable ds_read_tr* lowering for PartitionedSharedEncodingAttr (…
plognjen Apr 20, 2026
147a60d
[FPSAN] Insert barriers in between scratch loads and stores (#10055)
pawelszczerbuk Apr 20, 2026
2c7ce49
[AMD][gfx1250] Add Scaled WMMA 32x16 Shape for FP4 (#10082)
sriakrish Apr 20, 2026
64a06a4
Fix `tl.fdiv` to respect `ieee_rounding` flag (#10074)
JueonPark Apr 21, 2026
a378eb9
[AMD][gfx1250] Add fused SwiGLU mode to gluon f16 GEMM examples (#10076)
kelesvol Apr 21, 2026
8a72e80
Fix docs build with Sphinx 9 (#10091)
ThomasRaoux Apr 21, 2026
6d63b16
[NVIDIA] Add missing NVWS tablegen dependencies to NVHopperTransforms…
glassmanK Apr 21, 2026
523a044
[Nvidia][Gluon] Refactor convolution fprop kernel, add wgrad and dgra…
bingyizh233 Apr 21, 2026
029b260
[AMD][CI] Continue on error if on gfx942 (#10092)
antiagainst Apr 21, 2026
dea2e9d
[AMD] Add atomic vectorization cap (#10093)
FrederickVu Apr 21, 2026
e91c088
[Frontend] Promote `_aggregate` to public (#10095)
peterbell10 Apr 21, 2026
b103a65
[AMD][gfx1250] Clamp padInterval based on TDM limits (#10097)
AlexAUT Apr 21, 2026
75e9d8b
[FPSAN] Clean up fpsan pattern failures (#10090)
pawelszczerbuk Apr 21, 2026
37c9a4b
[NFC] Kill a couple helpers and make internal functions static (#9988)
lezcano Apr 21, 2026
430fdd5
[Gluon][Translator] Hopper support (#10089)
Mogball Apr 22, 2026
e6ecb6d
Fix nested `def` in `@triton.jit` to raise intended `UnsupportedLangu…
swjng Apr 22, 2026
06179e4
[Docs] Add gluon to rendered docs (#10101)
peterbell10 Apr 22, 2026
81fb1b3
[AMD][gfx1250] Fix asynccnt computation for split async_stores (#10098)
AlexAUT Apr 22, 2026
4b3185e
[Gluon] Fix auto layout for split op (#9865)
ThomasRaoux Apr 22, 2026
e8495b9
[Gluon][examples] Some autotuning of the attention example (#10110)
Mogball Apr 23, 2026
af26750
[AMD][GLUON] Expose AMD scaled_upcast ops in Gluon (#10111)
FrederickVu Apr 23, 2026
d2f88ef
[AMD][gfx1250] Vector-typed internal TDM descriptor representation (#…
zhanglx13 Apr 23, 2026
63e38fe
[Gluon][Examples] Add 2 CTA to 05-moe-bmm1-fused-gather (#10114)
Mogball Apr 23, 2026
5ab55e0
Introduce GenericLinearEncodingAttr (#9765)
plognjen Apr 23, 2026
b56f3b4
[FPSAN] Add support for wgmma in fpsan (#10112)
pawelszczerbuk Apr 23, 2026
fb3de49
Add dense matmul benchmark for triton_kernels (#9850)
ThomasRaoux Apr 23, 2026
feade2c
[AMD][GFX12] Cache modifiers for buffer and async ops (#10109)
alefimov-amd Apr 23, 2026
9f34338
[DOCS] Clarify tl.load semantics when other is None (#10119)
javierdejesusda Apr 23, 2026
2760252
[Sanitizers] Run RemoveLayoutConversions when FPSan or ConSan are on …
Mogball Apr 23, 2026
a9ced83
[AMD][gfx1250] Fix TDM gather/scatter intrinsic waitcnt computation (…
AlexAUT Apr 24, 2026
f258f68
[LinearLayout] fix defect in invertAndCompose (#10116)
yangshuxin Apr 24, 2026
75c8951
Enable local load/store lowering with generic linear encoding (#10122)
plognjen Apr 24, 2026
fdfded1
Minimize exported symbols of libtriton (#9922)
neildhar Apr 24, 2026
4097dd7
[AMD][gluon] Derive gfx1250 write slot from read phase (#10088)
jungpark-mlir Apr 24, 2026
27c4028
Include `python/src` headers in installation artifacts (#9847)
abrown Apr 24, 2026
393f634
[AMD][gfx1250] Add fused SwiGLU mode to gluon mxfp GEMM example (#10108)
kelesvol Apr 24, 2026
243c16a
Fix AxisInfo correctness: signed constants, unknown shift, and shift …
he-yufeng Apr 25, 2026
4da2e26
[FPSAN] Add support for batched matmul, make fpsan errors fatal (#10118)
pawelszczerbuk Apr 25, 2026
aacfc48
[EZ][BACKEND] Reject memdesc reinterpret changing CTA layout size (#1…
lezcano Apr 27, 2026
a22f290
[AMD] NFC: Unify AMD GFX architecture pass option (#10138)
antiagainst Apr 27, 2026
9d9a2d7
[Gluon][Examples] Update MoE BMM1 selector configs (#10124)
Mogball Apr 27, 2026
72eb2f4
[AMD][CI] Switch gfx942 to use normal pytorch docker (#10146)
antiagainst Apr 27, 2026
1564875
[AMD] Add rocprofiler SDK HIP headers for proton use (#10144)
ZelboK Apr 28, 2026
7af3133
[Tutorials] Fix unbound local variable in Tutorial 10 (#10151)
CliveUnger Apr 28, 2026
39fe2e8
[BENCH][PROTON] Implement `do_bench_proton` and `do_bench_cudagraph_p…
Jokeren Apr 28, 2026
e570a63
[NFC] Update ConSan docs and move aux-data comments into Utility.h (#…
pawelszczerbuk Apr 28, 2026
5b3965b
[Triton] Fix LLVMDILocalVariable pass crash with LLVM_EXTRACT_DI_LOCA…
byoshimi-gmail Apr 28, 2026
2ad055b
[AMD] Enable loop unrolling for Gluon warp-pipelined kernels (#9666)
Hardcode84 Apr 28, 2026
07a41a4
[AMD][CI] Allow continue-on-error for all targets (#10161)
antiagainst Apr 28, 2026
9a9db53
[PROTON] Fix flakey test failure due to GPU memory pressure. (#10147)
byoshimi-gmail Apr 28, 2026
ce5391b
[CI] Disable flaky consan multicast case (#10162)
ThomasRaoux Apr 29, 2026
2e82bd6
[AMD][gfx1250] Support Triton-level descriptor gather/scatter (#10157)
jerryyin Apr 29, 2026
0f5f46e
[BACKEND][MMAv2] Move unsupported MMA instruction checks to an early …
Jokeren Apr 29, 2026
e42acde
[triton_kernels] Optimize matmul metadata metrics (#10150)
qnie-oai Apr 29, 2026
82ff564
Do not send f64 dots through tcgen05 (#10126)
vwbaker Apr 29, 2026
54581af
[AMD][Backend] Apply bitwise disjoint affine offset padding separatel…
FrederickVu Apr 29, 2026
fe44788
[BACKEND] Simplify mbarrier.expect lowering (#10168)
lezcano Apr 29, 2026
3cb4aec
Python typing followup to #10048 (#10120)
leijurv Apr 29, 2026
799916e
[AMD][gfx1250] Enable resolve GEMM LDS partition conflicts e2e (#10145)
plognjen Apr 29, 2026
e9618a6
Enable PartitionedSharedEncodingAttr in the memdesc_subslice lowering…
plognjen Apr 29, 2026
83210d5
setup.py: Make `is_git_repo` work for submodules & worktrees (#9641)
charlie-wt Apr 29, 2026
ca21b1b
[TritonGPU] Split RemoveLayoutConversions cleanup; tolerate SCF non-c…
warrendeng Apr 29, 2026
3267f9d
[NVIDIA] Decouple target feature queries from `TargetInfo` (#10175)
masahi Apr 29, 2026
e8e6247
[AMD][gfx1250] Support cache modifier in TDM operations (#10169)
alefimov-amd Apr 29, 2026
18415f7
[AMD][gfx9] Restore token-aware wait count derivation on asyncmark ta…
lijinpei-amd Apr 30, 2026
bd476df
[Gluon for AMD] unwrap constexpr None in buffer_load/buffer_store/_ve…
Dewei-Wang-sh Apr 30, 2026
bf24f10
[build] Add nvidia include overrides to GSan build (#10072)
meinie0826 Apr 30, 2026
0f2a3b1
[Reland][NVIDIA] Support swizzle 0 TMA + MMA for Hopper and Blackwell…
masahi Apr 30, 2026
0a00f22
Treat pure register reorder as free in backward remat (#10128)
neildhar Apr 30, 2026
c394e52
[GLUON] `shared_atomic_add` implementation (#10100)
Jokeren Apr 30, 2026
bc9d4b5
[AMD][gfx1250] Pipeline descriptor gather through the TDM async chain…
jerryyin Apr 30, 2026
cc9dbec
[AMD][gfx1250] Fix f32 to fp8*fnuz conversion and test skips (#10178)
antiagainst Apr 30, 2026
042b2cd
Add AutotuneListener hook to triton knobs (#10125)
FindHao Apr 30, 2026
5c0e7a0
[RUNTIME] Initialize profile scratch on stream 0 (#10186)
lezcano May 1, 2026
83940f4
[Backend] Respect IEEE input precision for tcgen05 (#10180)
peterbell10 May 1, 2026
e42e0a0
[AMD][gfx1250] Add v_pk_*_bf16 conversion and fix sqrt denorm (#10193)
FrederickVu May 1, 2026
898d407
[BACKEND] Sink tmem allocs after pipelining to reduce liveranges (#9093)
ThomasRaoux May 1, 2026
ef05530
Fix scaled MMA partition scheduling dependency (#10191)
meinie0826 May 1, 2026
5d69e1c
[AMD] Enable IN_THREAD_TRANSPOSE to GFX120* by default (#10185)
skysnow2001 May 1, 2026
19ccc01
[CONSAN] Fix deadlock detection in WS code (#10192)
lezcano May 2, 2026
b738eb6
[Proton] Normalize roofline metric dtypes (#10199)
qnie-oai May 2, 2026
7cff1f2
[FPSAN] Add fpsan docs to triton/docs/programming-guide/chapter-3 (#1…
pawelszczerbuk May 3, 2026
885569f
[mxfp4] Fix Hopper scale padding mask (#10190)
hthu May 4, 2026
efea759
[AMD][gfx1250] Update stale tests and examples (#10207)
antiagainst May 4, 2026
886365c
[AMD][gfx1250] Explicitly propagate NaN for MXFP attn example (#10208)
antiagainst May 4, 2026
dfd7c52
[BACKEND] Enable i16 descriptor gather/scatter indices on NVIDIA (#10…
ThomasRaoux May 4, 2026
e77dbcd
[MULTICTA] Fix multicast pattern for tcgen05_mma_scaled (#10196)
lezcano May 4, 2026
4790c35
[CONSAN] Add read before any write check (#10167)
lezcano May 4, 2026
8d078ff
[CONSAN] Fix i32 scratch offset overflow (#10171)
pawelszczerbuk May 4, 2026
12fe342
[AMD][gfx1250] Drop ttg.async_commit_group from TDM async chain (#10215)
jerryyin May 4, 2026
195e208
[ConSan] Fix missing captureBytes argument in passToWarpSpecialize ca…
antiagainst May 4, 2026
2bcd105
[Gluon] Unwrap constexpr bool in dot_scaled and associative_scan (#10…
knwng May 5, 2026
9282d71
[AMD][gfx1250] Add transposed support for 32x16 scaled WMMA (#10222)
sriakrish May 5, 2026
f5d1442
[PROTON] Skip flaky periodic flushing test (#10223)
Jokeren May 5, 2026
321275f
[Frontend] Unwrap constexpr before semantic (#10227)
peterbell10 May 5, 2026
560230d
[docs] Fix fpsan docs (#10228)
peterbell10 May 5, 2026
e527734
[gsan] Fix uses of node->size that should be node->allocSize (#10197)
peterbell10 May 5, 2026
71b8683
[GLUON] Generalize `local_atomic_scatter_add` to `local_atomic_scatte…
Jokeren May 5, 2026
7efdeb1
Fix ptr->int->ptr canonicalizer if types don't match (#10226)
neildhar May 5, 2026
b45acd4
[AMD] Fixing mi350 BlockPingpong update waits (#10194)
jerryyin May 5, 2026
cd8e4ac
[EXAMPLES] Implement multicta attention (#10211)
lezcano May 5, 2026
efd36c4
[AMD][gfx1250] Combine redundant amdgpu.async_tdm_wait ops (#10230)
jerryyin May 6, 2026
028278e
[PROTON] Using sequence id to find data entry instead of relying on t…
Jokeren May 6, 2026
379c934
Custom Intermediate split-k dtype (#10236)
ferrari-openai May 6, 2026
e9ac8e4
[fpsan] Make embed/unembed IR nodes to enable canonicalization (#10232)
peterbell10 May 6, 2026
60c81a3
[CD] Pin DOCKER_API_VERSION job-wide for Wheels workflow (#10245)
atalman May 6, 2026
7a02321
20260505 meetup notes (#10252)
byoshimi-gmail May 6, 2026
38ec066
[triton_kernels] nvfp4 x nvfp4 tuning (#10249)
jeffniu-openai May 6, 2026
36dbe0c
[PROTON][TEST] Fix test flakiness (#10237)
Jokeren May 7, 2026
4cd6bcb
[AMD][GFX1250] Reorder bf16 attn prologue TDM to match loop body (#10…
cagrikymk May 7, 2026
0dc7fde
[AMD][gluon] Accept boolean predicates in tdm load/gather (#10253)
peterbell10 May 7, 2026
40e899b
[BACKEND] Reinterpreted memory should represent the same amount of me…
lezcano May 7, 2026
af16bc1
Fix convert_layout lowering for CGA + slice layouts. (#10242)
meinie0826 May 7, 2026
23e48be
[CD] Reduce wheel size by excluding libtriton.so from auditwheel repa…
atalman May 7, 2026
3499593
[AMD][gfx1250] Enable f32 WMMA for AccelerateMatmul (#9886)
ravil-mobile May 7, 2026
da6372f
Allow MXFP8 LHS with Hopper-swizzled MXFP4 RHS (#10214)
roman-openai May 7, 2026
dd2c08b
[AMD][NFC] Refactor lowerInst to a function, cleanup of TileKind (#10…
nzaghen May 7, 2026
bc28f1b
[AMD] Ignore tf32 precision for fp64 mfma dots (#10216)
draganmladjenovic May 7, 2026
db1d298
[fpsan] Store to scratch buffers in embedded form (#10233)
peterbell10 May 7, 2026
fe119ae
[NFC] Remove some dead functions (#10247)
peterbell10 May 7, 2026
4a1df47
[NFC] Remove getElementType helper functions (#10235)
peterbell10 May 7, 2026
f4a3db9
[AMD][LAYOUTS] Refine optimal swizzling for wavefront64 (#9662)
amd-jianli12 May 7, 2026
f451bad
[NFC] Delete dead variables and functions (#10261)
neildhar May 8, 2026
587f2c4
[AMD][gfx1250] Add `update_tensor_descriptor` op (#10225)
zhanglx13 May 8, 2026
bfb6887
[NFC][TUTORIALS] Add some more info to multicta tutorial (#10258)
lezcano May 8, 2026
5f87f87
[BACKEND] add TMEM barrier insertion pass (#10263)
ThomasRaoux May 8, 2026
6fe3ed7
[Language] Add field inheritance, defaults, and aggregate_replace to …
blake-snc May 8, 2026
70b3e39
[NFC] Clean up top-level `CMakeLists.txt` (#10268)
abrown May 8, 2026
153b635
Fix negation of +0.0 (#10270)
jelle-openai May 8, 2026
9559997
[AMD] Support predicate descriptor load when pipelining (#10272)
antiagainst May 8, 2026
d80d286
[CONSAN] Multi CTA model v2 (#10212)
lezcano May 8, 2026
2defcb7
[AMD] Create TargetFeatures for architectural checks (#10142)
antiagainst May 9, 2026
521c2e3
[AMD] Extend chain-dot detection across loop iteration boundaries (#1…
raikonenfnu May 9, 2026
dc57615
Account for duplicated elementwise ops in backward remat cost (#10129)
neildhar May 10, 2026
a4b65bf
[BACKEND] Recognise fneg in clamp and use it in the MoE (#10280)
lezcano May 10, 2026
feb6c04
[PROTON] Group metadata kernels under the corresponding triton operat…
Jokeren May 11, 2026
3406408
Support zero-sized Hopper MX scale layouts (#10275)
roman-openai May 11, 2026
fd68aeb
[GSAN] Add CUDA feature flag for PTX 7.0 support in GSan build (#10282)
Jokeren May 11, 2026
0ec31f4
[triton-ext] Allow testing triton plugins in isolation (#10269)
abrown May 11, 2026
25d2d2d
Handle region control flow in remat cost calculation (#9201)
neildhar May 11, 2026
3e0cbc0
[AMD][gfx1250] Support predicate descriptor store for pipelining (#10…
antiagainst May 12, 2026
1d2f438
[AMD][gfx1250] Do not truncate TDM strides to 32bit (#10287)
AlexAUT May 12, 2026
36394d4
[AMD] Disable LLVM vector combine pass (#9260)
antiagainst May 12, 2026
215c162
Support returning tensors in TritonToTritonGPU (#10189)
leijurv May 12, 2026
ad911ca
[AMD] Warp-pipeline: back-to-back loop optimization & flat (unrolled)…
jungpark-mlir May 12, 2026
76763e9
[AMD][gfx1250] Add example MoE Gluon Kernel (#10204)
knwng May 12, 2026
877dbaf
Use ptxas 13.1.80 for all NVIDIA targets (#10294)
ThomasRaoux May 13, 2026
d315663
Revert "[CONSAN] Add read before any write check (#10167)" (#10297)
pawelszczerbuk May 13, 2026
4ee0dde
[PROTON] Centralize and categorize exception handling (#10173)
Jokeren May 13, 2026
8bfe473
[DOC] Add `do_bench_proton` and `do_bench_cudagraph_proton` to the do…
Jokeren May 13, 2026
6898a32
[GLUON][TEST] Test convert layout codegen with multiple CTAs (#10299)
Jokeren May 13, 2026
11459af
[AMD][gfx1250] Support warp usage hints in TDM copy (#10056)
jungpark-mlir May 13, 2026
f6eb36f
[AMD] Enable true16 for gfx11 (#10301)
skysnow2001 May 13, 2026
3de9d04
[ BACKEND ] Enable `tl.dot` with TF32 precision on tiles with N=8 an…
mwichro May 13, 2026
05dde66
Revert "Use ptxas 13.1.80 for all NVIDIA targets" (#10303)
ThomasRaoux May 13, 2026
06788ae
[AMD] Skip triton-to-gluon translator tests on RDNA3/RDNA4 (#10304)
saeid-rostami May 13, 2026
eae900d
[fpsan] Support arith.negf (#10306)
peterbell10 May 13, 2026
0dcc1c5
[BACKEND] Allow reinterpret to modify the rank (#10286)
lezcano May 13, 2026
44c91e5
[INTERPRETER] Implement tl.dot_scaled for the interpreter (#10311)
arp2600 May 14, 2026
12bbea0
[GLUON] Support broadcasting of shared scatter operations and unify o…
Jokeren May 14, 2026
5073b1f
[PROTON] Migrate Proton ROCm backend from roctracer to rocprofiler-sd…
ZelboK May 14, 2026
80f8d09
Speed up reduce_forward for the common case "broadcast_n" masking. (#…
yongjik May 15, 2026
70ff039
[AMD][gfx1250][TDM] Handle Negative Offsets as Fully Out-of-Bounds Ti…
AlexAUT May 15, 2026
7de28f0
[AMD] Guard MFMA store layout for small N dimensions (#10305)
justinrosner May 15, 2026
3c71c5f
Support MN-packing in decomposed fp4 dot_scaled (#10318)
masahi May 15, 2026
5e657f9
[BACKEND] Avoid creating dangling registers when `red` instruction is…
Jokeren May 15, 2026
67bcf4b
[runtime] Skip None args in autotune restore_value/reset_to_zero hook…
kasper0406 May 15, 2026
d930eb6
[CONSAN] Add smem and tmem initialization to NaN (#10308)
pawelszczerbuk May 15, 2026
e7c1b66
[Frontend] Support bare annotation expressions (#10322)
peterbell10 May 15, 2026
33a08e3
[BACKEND] Allow broadcasted CTA memdesc subslices (#10319)
ThomasRaoux May 15, 2026
7d4c7ce
Fixing splitClusterBefore implicit-insert bug (#10127)
mydatascience May 15, 2026
7c29da1
[BACKEND] Allow two_ctas=False barriers in TMA ops in a 2CTA kernel (…
lezcano May 15, 2026
6fb03a7
[PROTON] Keep CUDA graph metric replay state aligned (#10326)
qnie-oai May 16, 2026
4768da5
[Interpreter] Fix argmax/argmin NaN handling to match JIT semantics (…
songdejun May 16, 2026
839fde3
[Interpreter] Fix maximum/minimum/clamp NaN handling to match JIT sem…
Chennesxu May 18, 2026
0bdab22
Allow Gluon local_store with mismatched CGA layout (#10296)
ThomasRaoux May 19, 2026
fdfc3f9
[PROTON] Allow out-of-tree backends to register Proton profilers, dev…
GeorgeWigley May 19, 2026
ed8317b
[AMD][GFX9] Optimize warp-uniform direct to LDS predicates (#10332)
AlexAUT May 19, 2026
8b2e898
[CONSAN] Fix cluster_barrier handling (#10323)
lezcano May 19, 2026
a0370fe
[Tests] Drop pytest-forked in favour of run_in_process (#10335)
peterbell10 May 19, 2026
7ee763f
[PROTON] Update .gitignore to include additional stub files (#10339)
Jokeren May 19, 2026
5fa19a1
[AMD] Handle padded layout in MemDescReinterpretOp verifier (#10184)
yangshuxin May 20, 2026
57a214f
[AMD] Fix invalid LLVM intrinsic name for float buffer atomic max/min…
justinrosner May 20, 2026
c7391a2
Enable microscaled lhs with dense FP16/BF16 matmul weights (#10316)
roman-openai May 21, 2026
0914a04
Fix alignment error in unpacked tmem store with `16x32bx2` message (#…
masahi May 21, 2026
7a5d6a3
Fix small-K swizzled MXFP4 matmul (#10343)
roman-openai May 21, 2026
2c72e74
[AMD][GLUON] Expose padd layout deduction logic to Python (#10302)
xiaohuguo2023 May 21, 2026
0158f1d
[BACKEND] Fix modeling of ld acquire op (#10346)
ThomasRaoux May 21, 2026
00e397f
[FRONTEND] [FPSAN] Introduce `tl.expect_zero` primitive (#10330)
apgoucher May 21, 2026
87d45ca
Include examples/ in source distribution (#10349)
atalman May 21, 2026
419fd86
[FollowUp] Remove unused variable in `lowerTMemLdSt` (after #10321) (…
masahi May 21, 2026
96a22b5
Disallow sub-byte local_alloc (#10351)
ThomasRaoux May 21, 2026
e283e01
[CONSAN] Add barrier before NaN init (#10352)
pawelszczerbuk May 22, 2026
609ced5
[KERNELS] Perf tuning knobs for _reduce_forward kernel. (#10361)
yongjik May 24, 2026
c517f38
[Docs] Clarify multiple_of / max_contiguous / max_constancy semantics…
adityasingh2400 May 26, 2026
efbae9a
[KERNELS] make setting idle sms process-global (#10380)
aeng-openai May 26, 2026
5868810
Fix K-ragged Blackwell activation scales (#10382)
roman-openai May 27, 2026
2c873e5
Use `acc` dtype if `out_dtype` is not specified for `tl.dot` (#10353)
mrTsjolder May 27, 2026
3e233a6
[kernels] change heuristic of smem calculation to be more accurate (#…
wendazhou May 27, 2026
fb2ee67
[Analysis] Clamp SelectOp divisibility when condConstancy reduces out…
Manas103 May 27, 2026
0418ee6
[AMD] Enable InThreadTranspose pass for RDNA3 / RDNA3.5 (gfx110x/115x…
mgehre-amd May 27, 2026
0c3303d
[KERNELS] fix hopper mxfp4 swizzle bug (#10385)
aeng-openai May 27, 2026
76d5b92
[TEST] Add 2CTA tcgen05 Gluon coverage for transposed LHS shared layo…
ThomasRaoux May 27, 2026
71ba10e
[AMD] Add AMD LLVM kernel attributes option (#10367)
raikonenfnu May 27, 2026
0475130
[Test] Add more tests for cross CTA local_load/local_store (#10344)
ThomasRaoux May 27, 2026
a7512c9
[AMD] Fix layoutToGluon() handling of PaddedSharedEncodingAttr's CGA-…
yangshuxin May 28, 2026
6650ee3
Fix gluon attention example for fp8 (#10399)
masahi May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .flake8

This file was deleted.

17 changes: 17 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,20 @@ lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp @ptillet
# third_party
# -----------
third_party/amd/ @antiagainst @zhanglx13
third_party/proton/ @Jokeren @crobeck @fywkevin

# -----------
# gluon
# -----------
python/triton/experimental/gluon/ @peterbell10
python/src/gluon_ir.cc @peterbell10
python/test/gluon @peterbell10
test/Gluon @peterbell10
include/triton/Dialect/Gluon @peterbell10
lib/Dialect/Gluon @peterbell10

# -----------
# Linear Layouts
# -----------
lib/Tools/ @lezcano
lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp @lezcano
48 changes: 48 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Report a bug
description: Report triton failing to compile a kernel, or giving incorrect results
labels: ["bug"]

body:
- type: markdown
attributes:
value: |
#### Disclaimer
The core triton team is small and has very limited capacity. We may not have time to look into your report.
For the best results, please:
- Avoid submitting duplicates. Search through [the existing and past issues](https://github.com/triton-lang/triton/issues?q=is%3Aissue+sort%3Acreated-desc+) first to see if it's been reported previously.
- Check if the issue persists with a build from the latest source.
- Provide all relevant information in the initial report, to prevent unnecessary back and forth discussion.
- If you can, try to diagnose and/or fix the issue yourself. We welcome high quality contributions.
- type: textarea
attributes:
label: Describe the bug
description: |
Please provide a clear and concise description of what the bug is.

If relevant, add a [minimal complete example](https://stackoverflow.com/help/minimal-reproducible-example) that reproduces the bug. It is very important for the snippet to be as simple as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did, so include both the kernel and launching code as well as any relevant imports.

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
placeholder: |
A clear and concise description of what the bug is.

```python
# Sample code to reproduce the problem
```

```
The error message you got, with the full traceback.
```
validations:
required: true
- type: textarea
attributes:
label: Environment details
description: |
Please include any relevant context about how you're running the reproducer e.g. which version of triton, and what GPU you are using.
placeholder: |
Triton: ...
GPU: ...
validations:
required: true
5 changes: 5 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: Community help
url: https://discord.gg/gpumode
about: GPU-mode discord community has a triton channel which is a great resource for help writing/learning triton
44 changes: 44 additions & 0 deletions .github/ISSUE_TEMPLATE/performance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Report a performance issue
description: Report cases where triton is generating sub-optimal (but functionally correct) PTX/LLVM IR
labels: ["performance"]

body:
- type: markdown
attributes:
value: |
#### Disclaimer
The core triton team is small and has very limited capacity. We may not have time to look into your report.
For the best results, please:
- Avoid submitting duplicates. Search through [the existing and past issues](https://github.com/triton-lang/triton/issues?q=is%3Aissue+sort%3Acreated-desc+) first to see if it's been reported previously.
- Check if the issue persists with a build from the latest source.
- Provide all relevant information in the initial report, to prevent unnecessary back and forth discussion.
- If you can, try to diagnose and/or fix the issue yourself. We welcome high quality contributions.
- type: textarea
attributes:
label: Describe the issue
description: |
Please provide a clear and concise description of the issue.

Include a [minimal complete example](https://stackoverflow.com/help/minimal-reproducible-example) that reproduces the issue. It is very important for the snippet to be as simple as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did.

A reproducer could be a python program that runs a triton kernel and prints out the relevant suboptimal IR, or an IR file with an accompanying triton-opt command.

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
placeholder: |
A clear and concise description of the issue.

```python
# Sample code to reproduce the problem
```
validations:
required: true
- type: textarea
attributes:
label: Environment details
description: |
Please include any relevant context about how you're running the reproducer e.g. which version of triton, and what GPU you are using.
placeholder: |
Triton: ...
GPU: ...
validations:
required: true
3 changes: 3 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
<!---
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace `[ ]` with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a comment.

- [ ] I have written a PR description following these
Expand Down
135 changes: 135 additions & 0 deletions .github/workflows/build-macos.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
name: Build MacOS

on:
workflow_call:
inputs:
matrix:
required: true
type: string

jobs:
build-macos:
runs-on: ${{ matrix.runner }}
strategy:
matrix:
runner: ${{ fromJson(inputs.matrix) }}
timeout-minutes: 60
env:
RUNNER_TYPE: ${{ matrix.runner[0] }}
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
name: Build MacOS
steps:
- name: Checkout
uses: actions/checkout@v6
with:
submodules: "true"
- name: Install brew dependencies
run: |
brew update
brew install ccache llvm@19 lld coreutils
- name: Compute cache keys
id: cache-key
run: |
llvm_file="cmake/llvm-hash.txt"
nvidia_file="cmake/nvidia-toolchain-version.json"
json_file="cmake/json-version.txt"

# Check if files exist before proceeding
if [[ ! -f "$llvm_file" || ! -f "$nvidia_file" || ! -f "$json_file" ]]; then
echo "Error: Required dependency files are missing."
exit 1
fi

# Process the files if they exist
echo "llvm=$(cat $llvm_file | cut -c 1-8)" >> $GITHUB_OUTPUT
echo "nvidia=$(sha256sum $nvidia_file | cut -d ' ' -f 1)" >> $GITHUB_OUTPUT
echo "json=$(cat $json_file)" >> $GITHUB_OUTPUT
echo "datetime=$(date -u -Iseconds)" >> $GITHUB_OUTPUT
shell: bash
- name: Cache build dependencies
uses: actions/cache@v4
with:
# Note that we cannot use environment variables here given there is
# no shell to interpret them in the paths.
path: |
~/.triton/llvm
~/.triton/nvidia
~/.triton/json
key: ${{ runner.os }}-${{ runner.arch }}-llvm-${{ steps.cache-key.outputs.llvm }}-nvidia-${{ steps.cache-key.outputs.nvidia }}-json-${{ steps.cache-key.outputs.json }}
- # Cache ~/.cache/ccache to speed up compilation.
#
# On branch `main` we always start from an empty cache, i.e. we skip the
# "restore" step. This is to prevent the caches from accumulating stale
# files over time.
name: Restore cache of ccache and Triton compilation artifacts
id: restore-build-cache
if: github.ref != 'refs/heads/main'
uses: actions/cache/restore@v4
with:
path: |
~/.ccache
# Restore the most recent cache entry.
restore-keys: |
triton-artifacts-${{ runner.os }}-${{ runner.arch }}-${{ env.RUNNER_TYPE }}-llvm-${{ steps.cache-key.outputs.llvm }}-
triton-artifacts-${{ runner.os }}-${{ runner.arch }}-${{ env.RUNNER_TYPE }}-
# We expect this cache key never to hit and for us to fall back
# unconditionally to the restore-key, so it doesn't actually matter
# what we put here (so long as it doesn't hit an existing key).
key: triton-artifacts-${{ runner.os }}-${{ runner.arch }}-${{ env.RUNNER_TYPE }}-llvm-${{ steps.cache-key.outputs.llvm }}-${{ steps.cache-key.outputs.datetime }}
- name: Inspect cache directories
run: |
mkdir -p ~/.triton
du -h -d 1 ~/.triton

mkdir -p ~/.ccache
du -h -d 1 ~/.ccache
- name: Update PATH
run: |
echo "$HOME/.local/bin" >> $GITHUB_PATH
echo "/opt/homebrew/opt/llvm/bin" >> $GITHUB_PATH
- name: Create venv
run: |
python3 -m venv ~/.venv
source ~/.venv/bin/activate
python3 -m pip install --upgrade pip
- name: Install Triton
env:
TRITON_BUILD_WITH_O1: "true"
# macos-latest has 3 vcpus and 7GB DRAM, to save memory we limit the number of jobs to 3
# https://docs.github.com/en/actions/reference/github-hosted-runners-reference#standard-github-hosted-runners-for-public-repositories
MAX_JOBS: 3
# Add elapsed time in seconds to ninja status to monitor where build stalls
NINJA_STATUS: "[%f/%t, %es elapsed] "
run: |
source ~/.venv/bin/activate
echo "PATH is '$PATH'"
ccache --zero-stats
export PATH="/opt/homebrew/opt/llvm@19/bin:$PATH"
export CC="/opt/homebrew/opt/llvm@19/bin/clang"
export CXX="/opt/homebrew/opt/llvm@19/bin/clang++"
export CXXFLAGS="-stdlib=libc++"
export LDFLAGS="-L/opt/homebrew/opt/llvm@19/lib"
which clang++
clang++ --version
make dev-install
- name: CCache Stats
run: ccache --print-stats
- name: Inspect cache directories
run: |
mkdir -p ~/.triton
du -h -d 1 ~/.triton

mkdir -p ~/.ccache
du -h -d 1 ~/.ccache
- # If we're on branch `main`, save the ccache Triton compilation artifacts
# to the cache so they can be used by other (non-main) CI runs.
#
# (It wouldn't be a problem to save the cache on every run, because github
# evicts cache entries LRU, but maybe this saves a bit of time in CI.)
name: Save ccache and Triton compilation artifacts to cache
if: github.ref == 'refs/heads/main'
uses: actions/cache/save@v4
with:
path: |
~/.ccache
key: triton-artifacts-${{ runner.os }}-${{ runner.arch }}-${{ env.RUNNER_TYPE }}-llvm-${{ steps.cache-key.outputs.llvm }}-${{ steps.cache-key.outputs.datetime }}
43 changes: 43 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Integration Tests
on:
workflow_dispatch:
pull_request:
branches-ignore: ['llvm-**']
merge_group:
branches: [main, 'dev-**']
types: [checks_requested]
push:
branches: [main]
concurrency:
group: ${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
permissions: read-all

jobs:

runner-preparation:
uses: ./.github/workflows/runner-preparation.yml

pre-commit:
uses: ./.github/workflows/pre-commit.yml

integration-tests-nvidia:
needs: runner-preparation
if: needs.runner-preparation.outputs.matrix-NVIDIA != ''
uses: ./.github/workflows/integration-tests-nvidia.yml
with:
matrix: ${{ needs.runner-preparation.outputs.matrix-NVIDIA }}

integration-tests-amd:
needs: runner-preparation
if: needs.runner-preparation.outputs.matrix-AMD != ''
uses: ./.github/workflows/integration-tests-amd.yml
with:
matrix: ${{ needs.runner-preparation.outputs.matrix-AMD }}

build-macos:
needs: runner-preparation
if: needs.runner-preparation.outputs.matrix-MACOS != ''
uses: ./.github/workflows/build-macos.yml
with:
matrix: ${{ needs.runner-preparation.outputs.matrix-MACOS }}
77 changes: 77 additions & 0 deletions .github/workflows/create_release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: Create Release

on:
push:
branches:
- main
- release/*
tags:
# Final Release tags look like: v1.11.0
- v[0-9]+.[0-9]+.[0-9]+
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
release:
types: [published]
pull_request:
paths: [.github/workflows/create_release.yml]

jobs:

release:
if: ${{ github.repository == 'triton-lang/triton' }}
name: Create Release
runs-on: ubuntu-latest
permissions:
contents: write
outputs:
release_name: "${{ steps.release_name.outputs.name }}"
steps:
- uses: actions/checkout@v6
with:
show-progress: false
submodules: 'recursive'
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
- name: Fake name for PRs
if: ${{ github.event_name == 'pull_request' }}
run: echo "PT_GITHUB_REF=refs/tags/pr-tag" >> "$GITHUB_ENV"
- name: Real name for non-PRs
if: ${{ github.event_name != 'pull_request' }}
run: echo "PT_GITHUB_REF=$GITHUB_REF" >> "$GITHUB_ENV"
- name: Set filenames
run: |
tag_or_branch="${PT_GITHUB_REF#refs/tags/}"
tag_or_branch="${tag_or_branch#refs/heads/}"
# replace directory separators with _ in branch name
tag_or_branch="${tag_or_branch//\//_}"
if [[ ${tag_or_branch} == v* ]]; then
# strip trailing v from tag name
tag_or_branch="${tag_or_branch#v}"
# important: version must be fixed in setup.py
sed -i -e "s:^TRITON_VERSION = .*:TRITON_VERSION = '${tag_or_branch}':" setup.py || exit 1
fi
echo "RELEASE_NAME=triton-$tag_or_branch" >> "$GITHUB_ENV"
- name: Create source distribution
run: |
pip install build || exit 1
python -m build -s || exit 1
cd dist || exit 1
release_file=( *.tar.gz )
echo "RELEASE_FILE=${release_file}" >> "$GITHUB_ENV"
- name: Upload source distribution for release
if: ${{ github.event_name == 'release' }}
uses: softprops/action-gh-release@v3
with:
files: dist/${{env.RELEASE_FILE}}
- name: Upload source distribution to GHA artifacts for release tags
if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
uses: actions/upload-artifact@v7
with:
name: ${{ env.RELEASE_FILE }}
path: dist/${{ env.RELEASE_FILE }}
- name: Set output
id: release_name
run: echo "name=release_name::${{ env.RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
cancel-in-progress: true
Loading