Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
1385fe3
vulkan: allow using fp16 in coopmat1 flash attention shader
0cc4m Feb 3, 2026
3419b41
split rows inside of subgroups for faster synchronization
0cc4m Feb 3, 2026
e924abe
use row_split when Br >= 4, change reductions to use shared memory if…
0cc4m Feb 5, 2026
29751e2
use f32 scalar FA if f16 is not supported by device
0cc4m Feb 5, 2026
65124f1
fix amd workgroup size issue
0cc4m Feb 5, 2026
ee17050
optimize masksh use
0cc4m Feb 6, 2026
2186838
add medium rows FA shader Br size
0cc4m Feb 6, 2026
b4f1f64
fixes
0cc4m Feb 7, 2026
b9f155d
add padding to mask shmem buffer
0cc4m Feb 7, 2026
c932e0d
cache q values into registers for KQ
0cc4m Feb 7, 2026
d2f428c
fuse lf accumulation, pf and v accumulation into a loop
0cc4m Feb 8, 2026
2b3ed40
stage K loads through shmem
0cc4m Feb 8, 2026
a4200ef
stage V loads through shmem
0cc4m Feb 8, 2026
7a6e762
only stage through shmem on Nvidia
0cc4m Feb 8, 2026
e1ac2d9
default to Bc 32
0cc4m Feb 8, 2026
26ad714
also stage V through shmem when this is done for K
0cc4m Feb 8, 2026
53acdd0
dynamic subgroups for intel
0cc4m Feb 8, 2026
8a28404
use vectorized stores
0cc4m Feb 9, 2026
1fd08bb
use float_type for dequantize4 functions
0cc4m Feb 9, 2026
45c0775
use smaller scalar rows size for smaller rows count
0cc4m Feb 10, 2026
6c7d10e
relax flash attention split_k condition to allow non-gqa use
0cc4m Feb 10, 2026
9d79f3f
use minimal subgroup size on Intel
0cc4m Feb 10, 2026
0d7ed79
fix shmem support function
0cc4m Feb 12, 2026
3820695
fix rebase issues
0cc4m Feb 12, 2026
638028f
fixes
0cc4m Feb 12, 2026
4f93601
Bc 4 for scalar FA is not a valid configuration
0cc4m Feb 12, 2026
52c8b67
Use wave32 on AMD RDNA for scalar FA
0cc4m Feb 12, 2026
fb41148
add Intel shader core count lookup-table
0cc4m Feb 13, 2026
a9d3f12
fix regressions
0cc4m Feb 14, 2026
2db2f21
device tuning
0cc4m Feb 14, 2026
93ae001
tmpsh size fix
0cc4m Feb 14, 2026
2975244
fix editorconfig
0cc4m Feb 14, 2026
f376795
refactor fa tuning logic into a single place
0cc4m Feb 17, 2026
05d2283
fix gqa opt logic
0cc4m Feb 17, 2026
851e832
fix block_rows with small n_rows
0cc4m Feb 17, 2026
9746ae1
amd tuning
0cc4m Feb 17, 2026
d7c934c
fix hsk=72/80 issue
0cc4m Feb 17, 2026
497c3e7
tuning
0cc4m Feb 18, 2026
1cce6cd
allow condition skipping for column check
0cc4m Feb 19, 2026
ad37f12
use float16 for Of if available
0cc4m Feb 19, 2026
6f2dacd
address feedback
0cc4m Feb 20, 2026
87e6f1b
fix bad RDNA performance on head size <= 128 by limiting occupancy
0cc4m Feb 21, 2026
a740402
allow printing pipeline stats
0cc4m Feb 22, 2026
b28bfea
cleanup and fixes
0cc4m Feb 22, 2026
c73e128
limit occupancy for GCN for small batch FA with large HSK
0cc4m Feb 22, 2026
ae849d3
disable f16 FA for GCN AMD GPUs on the proprietary driver
0cc4m Feb 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading