UPSTREAM PR #19625: Vulkan Scalar Flash Attention Refactor by loci-dev · Pull Request #1178 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-15T03:08:54Z

Note

Source pull request: ggml-org/llama.cpp#19625

This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations.

AI-generated summary of changes

Scalar Flash Attention Core Optimizations

Implemented row splitting within workgroups (row_split = 1 or 4) for better subgroup utilization
Added shared memory staging for K and V loads on Nvidia GPUs when head sizes < 256
Cached Q values in registers for KQ computation when HSK_per_thread > 16
Fused loop for Lf accumulation and Of scaling by eMf
Changed to vectorized vec4 stores for output
Optimized masksh layout with stride padding (Br + 1) and removed unnecessary barrier

Row Size Tiering

Replaced binary small_rows/large_rows with three-tier system: FA_ROWS_1, FA_ROWS_SMALL, FA_ROWS_LARGE
Dynamic Br selection based on head sizes, device vendor, and architecture
FA_ROWS_1 uses Br=1 for N=1, FA_ROWS_SMALL uses Br=8, FA_ROWS_LARGE uses Br=16
Device-specific adjustments: AMD GCN uses smaller Br, Intel uses Br=8 maximum

Vendor-Specific Optimizations

AMD RDNA: Use wave32 subgroup size for scalar FA when N=1
Intel: Added shader core count lookup table for Alchemist and Battlemage GPUs
Intel: Disable subgroup operations in favor of shared memory reductions
Intel Alchemist: Apply 2x shader core count multiplier for split_k calculation
Adjusted workgroup sizes per vendor and head size combinations

split_k Enhancements

Relaxed split_k conditions to support non-GQA workloads
Fixed dispatch logic to handle both GQA and non-GQA cases correctly
Improved split_k calculation based on total workgroup count and shader cores

Device Compatibility

Added FP32 shader variants (_fp32 suffix) for devices without FP16 support
Made FLOAT_TYPE conditional on device capabilities
Updated dequantize4 functions to use FLOAT_TYPE instead of hardcoded float

Shared Memory Management

Dynamic tmpsh sizing based on row_split and subgroup configuration
Added kvsh buffer for K/V staging (size conditional on SHMEM_STAGING flag)
Improved Qf buffer stride calculation
Fixed tmpsh size calculation for split_k temporaries

Code Path Selection

Switch from coopmat1 to scalar when N=1 or rows=FA_ROWS_1
Improved shared memory size checks for scalar path fallback
Better alignment checking and stride validation

Shader Compilation

Made coopmat1/coopmat2 pipeline creation conditional on device FP16 support
Added subgroup size configuration per code path and row configuration
Removed hardcoded subgroup size assumptions

Benchmarks

AMD Radeon Pro VII

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	1003.15 ± 0.89	800.28 ± 1.41	827.57 ± 0.74	+3.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	85.12 ± 1.39	98.55 ± 0.55	97.83 ± 0.47	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	689.31 ± 0.64	174.36 ± 0.42	388.72 ± 3.37	+122.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	69.91 ± 0.20	55.97 ± 0.20	72.24 ± 0.34	+29.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	525.25 ± 1.68	84.33 ± 0.11	247.07 ± 1.51	+193.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	60.48 ± 0.17	41.46 ± 0.12	57.70 ± 0.57	+39.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1061.99 ± 7.85	1319.64 ± 7.82	1321.90 ± 6.90	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	110.86 ± 0.97	136.10 ± 0.27	127.75 ± 0.88	-6.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	745.39 ± 1.25	757.62 ± 3.94	740.88 ± 4.66	-2.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	101.64 ± 0.41	116.38 ± 0.17	113.37 ± 0.93	-2.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	577.95 ± 3.32	509.10 ± 3.64	484.85 ± 2.85	-4.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	99.23 ± 0.21	107.31 ± 0.68	102.88 ± 1.13	-4.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	351.98 ± 3.24	749.40 ± 5.15	759.11 ± 4.74	+1.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	68.83 ± 0.11	95.12 ± 0.22	93.94 ± 0.45	-1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	295.91 ± 3.09	207.63 ± 0.63	312.17 ± 5.34	+50.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	60.01 ± 0.77	55.87 ± 0.35	73.73 ± 0.68	+32.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	247.76 ± 0.77	114.90 ± 0.42	191.18 ± 1.32	+66.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	55.69 ± 0.30	44.11 ± 0.11	61.76 ± 0.63	+40.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	641.90 ± 2.66	657.73 ± 3.46	740.63 ± 1.78	+12.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	47.72 ± 0.13	64.38 ± 0.19	65.54 ± 0.32	+1.8%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	293.28 ± 0.54	83.15 ± 0.33	129.38 ± 0.69	+55.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	38.76 ± 0.07	35.93 ± 0.20	37.94 ± 0.33	+5.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	189.33 ± 0.18	41.62 ± 0.24	70.77 ± 0.49	+70.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	31.80 ± 0.08	24.39 ± 0.36	26.41 ± 0.22	+8.3%

AMD 8060S

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	994.34 ± 34.50	947.41 ± 7.78	-4.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	45.14 ± 0.44	44.86 ± 0.42	-0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	418.71 ± 11.10	397.77 ± 8.90	-5.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	35.83 ± 0.09	35.68 ± 0.08	-0.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	234.05 ± 5.66	246.05 ± 11.58	+5.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	30.53 ± 0.08	30.13 ± 0.11	-1.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1263.73 ± 34.96	1208.77 ± 37.78	-4.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	73.19 ± 0.13	72.68 ± 0.10	-0.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	920.01 ± 4.93	919.00 ± 4.71	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	66.74 ± 0.45	66.42 ± 0.13	-0.5%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	670.22 ± 4.61	670.46 ± 5.07	+0.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	61.53 ± 0.78	61.78 ± 1.08	+0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	945.03 ± 32.97	992.30 ± 11.33	+5.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	91.76 ± 0.06	91.60 ± 0.53	-0.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	487.96 ± 2.76	479.56 ± 4.25	-1.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	66.47 ± 0.33	66.13 ± 0.27	-0.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	302.07 ± 1.01	286.72 ± 1.03	-5.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	50.54 ± 0.19	49.64 ± 0.88	-1.8%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512	924.97 ± 10.45	923.58 ± 4.06	-0.2%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128	61.52 ± 0.34	61.43 ± 0.41	-0.1%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d8192	306.02 ± 0.84	297.15 ± 0.91	-2.9%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d8192	38.31 ± 0.20	39.20 ± 0.17	+2.3%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d16384	192.72 ± 0.35	182.25 ± 0.82	-5.4%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d16384	27.83 ± 0.16	28.83 ± 0.01	+3.6%

AMD 8060S (Without Coopmat)

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	815.03 ± 7.22	822.68 ± 4.39	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	44.96 ± 0.22	45.36 ± 0.30	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	67.06 ± 4.00	190.34 ± 2.98	+183.8%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	31.53 ± 0.13	35.31 ± 0.28	+12.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	28.05 ± 0.85	78.89 ± 4.18	+181.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	25.53 ± 0.17	29.71 ± 0.08	+16.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1249.96 ± 37.10	1187.02 ± 15.67	-5.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	73.17 ± 0.06	72.39 ± 0.23	-1.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	681.99 ± 1.44	681.63 ± 2.60	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	66.34 ± 0.35	66.37 ± 0.21	+0.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	438.09 ± 2.70	408.44 ± 7.02	-6.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	61.46 ± 0.62	61.54 ± 0.76	+0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	617.33 ± 13.14	614.00 ± 6.22	-0.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	94.84 ± 0.20	92.14 ± 0.22	-2.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	179.49 ± 0.92	227.94 ± 1.12	+27.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	57.91 ± 0.39	67.14 ± 0.11	+15.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	86.39 ± 0.78	128.04 ± 0.64	+48.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	43.22 ± 0.18	51.58 ± 0.14	+19.3%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512	727.26 ± 4.81	810.87 ± 5.13	+11.5%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128	61.59 ± 0.70	61.90 ± 0.12	+0.5%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d8192	105.57 ± 0.50	178.01 ± 0.22	+68.6%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d8192	38.58 ± 0.19	39.50 ± 0.33	+2.4%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d16384	52.56 ± 0.29	94.60 ± 0.41	+80.0%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d16384	28.02 ± 0.18	28.98 ± 0.06	+3.4%

Intel A770

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	818.22 ± 0.63	812.84 ± 1.85	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	32.64 ± 0.07	32.45 ± 0.05	-0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d2048	97.15 ± 0.05	550.81 ± 1.20	+467.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d2048	21.67 ± 0.02	27.75 ± 0.02	+28.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d4096	43.79 ± 2.97	405.21 ± 0.78	+825.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d4096	17.28 ± 0.00	25.06 ± 0.01	+45.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	930.73 ± 3.24	898.65 ± 3.47	-3.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	41.29 ± 0.07	37.53 ± 0.11	-9.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d2048	701.16 ± 3.52	670.17 ± 4.91	-4.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d2048	31.19 ± 0.06	31.73 ± 0.03	+1.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d4096	545.63 ± 1.16	495.18 ± 0.71	-9.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d4096	28.83 ± 0.09	29.27 ± 0.04	+1.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	640.10 ± 3.55	657.27 ± 3.54	+2.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	33.43 ± 0.08	30.04 ± 0.03	-10.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d2048	60.27 ± 4.78	281.25 ± 1.21	+366.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d2048	20.16 ± 0.02	22.98 ± 0.03	+14.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d4096	26.38 ± 0.63	310.19 ± 1.68	+1075.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d4096	18.27 ± 0.03	23.61 ± 0.08	+29.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	167.35 ± 0.17	66.63 ± 0.23	-60.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	19.23 ± 0.01	20.38 ± 0.03	+6.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d2048	26.23 ± 1.02	25.38 ± 0.01	-3.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d2048	5.95 ± 0.00	13.59 ± 0.01	+128.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d4096	25.54 ± 0.02	25.29 ± 0.04	-1.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d4096	3.64 ± 0.00	10.37 ± 0.00	+184.9%

Nvidia RTX 3090 (Coopmat2)

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4666.60 ± 19.46	4721.23 ± 12.32	+1.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	144.71 ± 1.53	147.49 ± 0.52	+1.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	3426.64 ± 19.29	3428.98 ± 22.04	+0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	114.85 ± 0.97	115.92 ± 0.34	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	2695.37 ± 16.65	2692.89 ± 16.34	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	99.65 ± 0.73	99.82 ± 0.29	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	4520.31 ± 33.68	4513.71 ± 30.22	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	177.65 ± 0.75	177.15 ± 0.77	-0.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	4040.47 ± 78.90	4049.94 ± 174.56	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	156.59 ± 1.58	155.91 ± 0.78	-0.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	3546.97 ± 21.35	3529.89 ± 36.63	-0.5%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	147.96 ± 0.76	145.37 ± 0.48	-1.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	3469.59 ± 17.36	3465.49 ± 34.45	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	178.72 ± 0.64	177.48 ± 2.05	-0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	2508.75 ± 42.02	2500.37 ± 34.47	-0.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	141.66 ± 0.54	141.16 ± 0.65	-0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	1942.67 ± 15.90	1936.24 ± 20.12	-0.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	123.39 ± 0.72	123.21 ± 0.29	-0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	2287.89 ± 11.77	2289.12 ± 9.34	+0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	116.47 ± 0.80	114.38 ± 3.56	-1.8%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	1047.29 ± 9.19	1047.12 ± 9.51	-0.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	90.74 ± 0.34	90.44 ± 0.37	-0.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	647.46 ± 3.70	644.65 ± 3.78	-0.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	81.92 ± 0.81	82.07 ± 0.20	+0.2%

Nvidia RTX 3090 (Coopmat1)

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4117.11 ± 10.81	4052.19 ± 17.94	-1.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	145.98 ± 1.84	144.04 ± 0.74	-1.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	2182.12 ± 11.97	2359.95 ± 10.14	+8.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	115.72 ± 0.56	116.46 ± 0.62	+0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	1486.54 ± 4.89	1671.90 ± 9.35	+12.5%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	99.15 ± 0.74	101.36 ± 0.32	+2.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	3062.95 ± 94.07	3090.31 ± 33.32	+0.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	175.29 ± 0.83	175.87 ± 0.88	+0.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	2439.28 ± 32.02	2494.98 ± 47.57	+2.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	148.99 ± 14.70	154.40 ± 2.18	+3.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	1964.74 ± 21.60	2098.26 ± 19.00	+6.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	147.55 ± 0.70	147.66 ± 0.69	+0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	2839.27 ± 26.12	2837.32 ± 30.26	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	174.78 ± 1.25	176.05 ± 1.26	+0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	1505.57 ± 14.41	1639.74 ± 14.94	+8.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	137.34 ± 0.86	139.22 ± 2.10	+1.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	1010.90 ± 10.49	1146.23 ± 14.19	+13.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	119.58 ± 0.71	121.95 ± 0.88	+2.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	1968.30 ± 10.15	1954.94 ± 33.29	-0.7%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	114.35 ± 0.87	115.05 ± 0.80	+0.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	554.73 ± 1.56	555.49 ± 1.82	+0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	62.50 ± 0.51	63.21 ± 0.34	+1.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	314.59 ± 0.93	315.91 ± 1.26	+0.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	43.01 ± 0.10	43.98 ± 0.15	+2.3%

Nvidia RTX 3090 (Without Coopmat)

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	2129.81 ± 5.52	2081.00 ± 42.53	-2.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	145.98 ± 0.24	144.26 ± 0.53	-1.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	997.77 ± 3.31	1048.43 ± 25.28	+5.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	110.19 ± 0.54	112.16 ± 0.12	+1.8%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	637.54 ± 1.09	701.26 ± 11.14	+10.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	94.33 ± 0.22	95.27 ± 0.31	+1.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	2410.79 ± 15.88	2331.15 ± 89.00	-3.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	176.60 ± 0.74	173.28 ± 0.72	-1.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	1582.99 ± 17.17	1429.18 ± 11.60	-9.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	153.60 ± 1.60	150.58 ± 0.91	-2.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	1114.36 ± 154.82	1009.61 ± 23.16	-9.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	146.14 ± 0.64	143.19 ± 1.18	-2.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	1159.21 ± 12.74	1137.29 ± 13.35	-1.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	177.45 ± 1.07	175.96 ± 1.95	-0.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	592.47 ± 4.68	620.55 ± 6.11	+4.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	130.00 ± 0.58	135.84 ± 1.70	+4.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	387.10 ± 1.89	425.32 ± 0.85	+9.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	113.49 ± 0.51	117.90 ± 0.71	+3.9%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	1050.83 ± 17.39	1092.14 ± 16.92	+3.9%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	114.66 ± 2.79	115.36 ± 3.33	+0.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	281.20 ± 1.84	342.26 ± 2.76	+21.7%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	63.73 ± 0.06	63.90 ± 0.37	+0.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	159.38 ± 1.00	202.89 ± 2.03	+27.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	43.40 ± 0.05	44.22 ± 0.09	+1.9%

loci-review · 2026-02-15T03:55:46Z

The analysis encountered an error. Please review the Processing Details for more information.

loci-review · 2026-02-15T03:56:49Z

The analysis encountered an error. Please review the Processing Details for more information.

loci-review · 2026-02-22T04:03:05Z

No meaningful performance changes were detected across 111507 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

… row_split == 1

loci-dev temporarily deployed to PROD__AL_DEMO February 15, 2026 03:08 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17

loci-dev force-pushed the loci/pr-19625-0cc4m-vulkan-fa-scalar-opt branch from 32d504c to 378d110 Compare February 22, 2026 03:07

loci-dev temporarily deployed to PROD__AL_DEMO February 22, 2026 03:07 — with GitHub Actions Inactive

0cc4m added 14 commits February 22, 2026 08:22

vulkan: allow using fp16 in coopmat1 flash attention shader

db57aa2

split rows inside of subgroups for faster synchronization

3bf3a91

use row_split when Br >= 4, change reductions to use shared memory if…

6ed5e08

… row_split == 1

use f32 scalar FA if f16 is not supported by device

05907cf

fix amd workgroup size issue

856bd66

optimize masksh use

a7bb3d2

add medium rows FA shader Br size

d587792

fixes

a824ff8

add padding to mask shmem buffer

c6ff101

cache q values into registers for KQ

8d33118

fuse lf accumulation, pf and v accumulation into a loop

691e275

stage K loads through shmem

971e39a

stage V loads through shmem

6c50ce0

only stage through shmem on Nvidia

0aedc9b

loci-dev force-pushed the main branch 8 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17

loci-dev force-pushed the main branch 8 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16

loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17

loci-dev force-pushed the main branch 4 times, most recently from 945fa3a to 0e8e1d6 Compare March 20, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19625: Vulkan Scalar Flash Attention Refactor#1178

UPSTREAM PR #19625: Vulkan Scalar Flash Attention Refactor#1178
loci-dev wants to merge 45 commits intomainfrom
loci/pr-19625-0cc4m-vulkan-fa-scalar-opt

loci-dev commented Feb 15, 2026

Uh oh!

loci-review bot commented Feb 15, 2026

Uh oh!

loci-review bot commented Feb 15, 2026

Uh oh!

loci-review bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 15, 2026

Scalar Flash Attention Core Optimizations

Row Size Tiering

Vendor-Specific Optimizations

split_k Enhancements

Device Compatibility

Shared Memory Management

Code Path Selection

Shader Compilation

Benchmarks

Uh oh!

loci-review bot commented Feb 15, 2026

Uh oh!

loci-review bot commented Feb 15, 2026

Uh oh!

loci-review bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants