Skip to content

UPSTREAM PR #19625: Vulkan Scalar Flash Attention Refactor#1178

Open
loci-dev wants to merge 45 commits intomainfrom
loci/pr-19625-0cc4m-vulkan-fa-scalar-opt
Open

UPSTREAM PR #19625: Vulkan Scalar Flash Attention Refactor#1178
loci-dev wants to merge 45 commits intomainfrom
loci/pr-19625-0cc4m-vulkan-fa-scalar-opt

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19625

This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations.

AI-generated summary of changes

Scalar Flash Attention Core Optimizations

  • Implemented row splitting within workgroups (row_split = 1 or 4) for better subgroup utilization
  • Added shared memory staging for K and V loads on Nvidia GPUs when head sizes < 256
  • Cached Q values in registers for KQ computation when HSK_per_thread > 16
  • Fused loop for Lf accumulation and Of scaling by eMf
  • Changed to vectorized vec4 stores for output
  • Optimized masksh layout with stride padding (Br + 1) and removed unnecessary barrier

Row Size Tiering

  • Replaced binary small_rows/large_rows with three-tier system: FA_ROWS_1, FA_ROWS_SMALL, FA_ROWS_LARGE
  • Dynamic Br selection based on head sizes, device vendor, and architecture
  • FA_ROWS_1 uses Br=1 for N=1, FA_ROWS_SMALL uses Br=8, FA_ROWS_LARGE uses Br=16
  • Device-specific adjustments: AMD GCN uses smaller Br, Intel uses Br=8 maximum

Vendor-Specific Optimizations

  • AMD RDNA: Use wave32 subgroup size for scalar FA when N=1
  • Intel: Added shader core count lookup table for Alchemist and Battlemage GPUs
  • Intel: Disable subgroup operations in favor of shared memory reductions
  • Intel Alchemist: Apply 2x shader core count multiplier for split_k calculation
  • Adjusted workgroup sizes per vendor and head size combinations

split_k Enhancements

  • Relaxed split_k conditions to support non-GQA workloads
  • Fixed dispatch logic to handle both GQA and non-GQA cases correctly
  • Improved split_k calculation based on total workgroup count and shader cores

Device Compatibility

  • Added FP32 shader variants (_fp32 suffix) for devices without FP16 support
  • Made FLOAT_TYPE conditional on device capabilities
  • Updated dequantize4 functions to use FLOAT_TYPE instead of hardcoded float

Shared Memory Management

  • Dynamic tmpsh sizing based on row_split and subgroup configuration
  • Added kvsh buffer for K/V staging (size conditional on SHMEM_STAGING flag)
  • Improved Qf buffer stride calculation
  • Fixed tmpsh size calculation for split_k temporaries

Code Path Selection

  • Switch from coopmat1 to scalar when N=1 or rows=FA_ROWS_1
  • Improved shared memory size checks for scalar path fallback
  • Better alignment checking and stride validation

Shader Compilation

  • Made coopmat1/coopmat2 pipeline creation conditional on device FP16 support
  • Added subgroup size configuration per code path and row configuration
  • Removed hardcoded subgroup size assumptions

Benchmarks

AMD Radeon Pro VII
model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 1003.15 ± 0.89 800.28 ± 1.41 827.57 ± 0.74 +3.4%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 85.12 ± 1.39 98.55 ± 0.55 97.83 ± 0.47 -0.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 689.31 ± 0.64 174.36 ± 0.42 388.72 ± 3.37 +122.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 69.91 ± 0.20 55.97 ± 0.20 72.24 ± 0.34 +29.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 525.25 ± 1.68 84.33 ± 0.11 247.07 ± 1.51 +193.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 60.48 ± 0.17 41.46 ± 0.12 57.70 ± 0.57 +39.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 1061.99 ± 7.85 1319.64 ± 7.82 1321.90 ± 6.90 +0.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 110.86 ± 0.97 136.10 ± 0.27 127.75 ± 0.88 -6.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 745.39 ± 1.25 757.62 ± 3.94 740.88 ± 4.66 -2.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 101.64 ± 0.41 116.38 ± 0.17 113.37 ± 0.93 -2.6%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 577.95 ± 3.32 509.10 ± 3.64 484.85 ± 2.85 -4.8%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 99.23 ± 0.21 107.31 ± 0.68 102.88 ± 1.13 -4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 351.98 ± 3.24 749.40 ± 5.15 759.11 ± 4.74 +1.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 68.83 ± 0.11 95.12 ± 0.22 93.94 ± 0.45 -1.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 295.91 ± 3.09 207.63 ± 0.63 312.17 ± 5.34 +50.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 60.01 ± 0.77 55.87 ± 0.35 73.73 ± 0.68 +32.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 247.76 ± 0.77 114.90 ± 0.42 191.18 ± 1.32 +66.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 55.69 ± 0.30 44.11 ± 0.11 61.76 ± 0.63 +40.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 641.90 ± 2.66 657.73 ± 3.46 740.63 ± 1.78 +12.6%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 47.72 ± 0.13 64.38 ± 0.19 65.54 ± 0.32 +1.8%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d8192 293.28 ± 0.54 83.15 ± 0.33 129.38 ± 0.69 +55.6%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d8192 38.76 ± 0.07 35.93 ± 0.20 37.94 ± 0.33 +5.6%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d16384 189.33 ± 0.18 41.62 ± 0.24 70.77 ± 0.49 +70.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d16384 31.80 ± 0.08 24.39 ± 0.36 26.41 ± 0.22 +8.3%
AMD 8060S
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 994.34 ± 34.50 947.41 ± 7.78 -4.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 45.14 ± 0.44 44.86 ± 0.42 -0.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 418.71 ± 11.10 397.77 ± 8.90 -5.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 35.83 ± 0.09 35.68 ± 0.08 -0.4%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 234.05 ± 5.66 246.05 ± 11.58 +5.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 30.53 ± 0.08 30.13 ± 0.11 -1.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 1263.73 ± 34.96 1208.77 ± 37.78 -4.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 73.19 ± 0.13 72.68 ± 0.10 -0.7%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 920.01 ± 4.93 919.00 ± 4.71 -0.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 66.74 ± 0.45 66.42 ± 0.13 -0.5%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 670.22 ± 4.61 670.46 ± 5.07 +0.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 61.53 ± 0.78 61.78 ± 1.08 +0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 945.03 ± 32.97 992.30 ± 11.33 +5.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 91.76 ± 0.06 91.60 ± 0.53 -0.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 487.96 ± 2.76 479.56 ± 4.25 -1.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 66.47 ± 0.33 66.13 ± 0.27 -0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 302.07 ± 1.01 286.72 ± 1.03 -5.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 50.54 ± 0.19 49.64 ± 0.88 -1.8%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 924.97 ± 10.45 923.58 ± 4.06 -0.2%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 61.52 ± 0.34 61.43 ± 0.41 -0.1%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 @ d8192 306.02 ± 0.84 297.15 ± 0.91 -2.9%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 @ d8192 38.31 ± 0.20 39.20 ± 0.17 +2.3%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 @ d16384 192.72 ± 0.35 182.25 ± 0.82 -5.4%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 @ d16384 27.83 ± 0.16 28.83 ± 0.01 +3.6%
AMD 8060S (Without Coopmat)
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 815.03 ± 7.22 822.68 ± 4.39 +0.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 44.96 ± 0.22 45.36 ± 0.30 +0.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 67.06 ± 4.00 190.34 ± 2.98 +183.8%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 31.53 ± 0.13 35.31 ± 0.28 +12.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 28.05 ± 0.85 78.89 ± 4.18 +181.2%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 25.53 ± 0.17 29.71 ± 0.08 +16.4%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 1249.96 ± 37.10 1187.02 ± 15.67 -5.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 73.17 ± 0.06 72.39 ± 0.23 -1.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 681.99 ± 1.44 681.63 ± 2.60 -0.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 66.34 ± 0.35 66.37 ± 0.21 +0.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 438.09 ± 2.70 408.44 ± 7.02 -6.8%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 61.46 ± 0.62 61.54 ± 0.76 +0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 617.33 ± 13.14 614.00 ± 6.22 -0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 94.84 ± 0.20 92.14 ± 0.22 -2.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 179.49 ± 0.92 227.94 ± 1.12 +27.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 57.91 ± 0.39 67.14 ± 0.11 +15.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 86.39 ± 0.78 128.04 ± 0.64 +48.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 43.22 ± 0.18 51.58 ± 0.14 +19.3%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 727.26 ± 4.81 810.87 ± 5.13 +11.5%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 61.59 ± 0.70 61.90 ± 0.12 +0.5%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 @ d8192 105.57 ± 0.50 178.01 ± 0.22 +68.6%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 @ d8192 38.58 ± 0.19 39.50 ± 0.33 +2.4%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 pp512 @ d16384 52.56 ± 0.29 94.60 ± 0.41 +80.0%
deepseek2 30B.A3B Q4_0 16.03 GiB 29.94 B 99 1 tg128 @ d16384 28.02 ± 0.18 28.98 ± 0.06 +3.4%
Intel A770
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 818.22 ± 0.63 812.84 ± 1.85 -0.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 32.64 ± 0.07 32.45 ± 0.05 -0.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d2048 97.15 ± 0.05 550.81 ± 1.20 +467.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d2048 21.67 ± 0.02 27.75 ± 0.02 +28.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d4096 43.79 ± 2.97 405.21 ± 0.78 +825.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d4096 17.28 ± 0.00 25.06 ± 0.01 +45.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 930.73 ± 3.24 898.65 ± 3.47 -3.4%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 41.29 ± 0.07 37.53 ± 0.11 -9.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d2048 701.16 ± 3.52 670.17 ± 4.91 -4.4%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d2048 31.19 ± 0.06 31.73 ± 0.03 +1.7%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d4096 545.63 ± 1.16 495.18 ± 0.71 -9.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d4096 28.83 ± 0.09 29.27 ± 0.04 +1.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 640.10 ± 3.55 657.27 ± 3.54 +2.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 33.43 ± 0.08 30.04 ± 0.03 -10.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d2048 60.27 ± 4.78 281.25 ± 1.21 +366.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d2048 20.16 ± 0.02 22.98 ± 0.03 +14.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d4096 26.38 ± 0.63 310.19 ± 1.68 +1075.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d4096 18.27 ± 0.03 23.61 ± 0.08 +29.2%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 167.35 ± 0.17 66.63 ± 0.23 -60.2%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 19.23 ± 0.01 20.38 ± 0.03 +6.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d2048 26.23 ± 1.02 25.38 ± 0.01 -3.2%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d2048 5.95 ± 0.00 13.59 ± 0.01 +128.4%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d4096 25.54 ± 0.02 25.29 ± 0.04 -1.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d4096 3.64 ± 0.00 10.37 ± 0.00 +184.9%
Nvidia RTX 3090 (Coopmat2)
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 4666.60 ± 19.46 4721.23 ± 12.32 +1.2%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 144.71 ± 1.53 147.49 ± 0.52 +1.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 3426.64 ± 19.29 3428.98 ± 22.04 +0.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 114.85 ± 0.97 115.92 ± 0.34 +0.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 2695.37 ± 16.65 2692.89 ± 16.34 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 99.65 ± 0.73 99.82 ± 0.29 +0.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 4520.31 ± 33.68 4513.71 ± 30.22 -0.1%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 177.65 ± 0.75 177.15 ± 0.77 -0.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 4040.47 ± 78.90 4049.94 ± 174.56 +0.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 156.59 ± 1.58 155.91 ± 0.78 -0.4%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 3546.97 ± 21.35 3529.89 ± 36.63 -0.5%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 147.96 ± 0.76 145.37 ± 0.48 -1.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 3469.59 ± 17.36 3465.49 ± 34.45 -0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 178.72 ± 0.64 177.48 ± 2.05 -0.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 2508.75 ± 42.02 2500.37 ± 34.47 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 141.66 ± 0.54 141.16 ± 0.65 -0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 1942.67 ± 15.90 1936.24 ± 20.12 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 123.39 ± 0.72 123.21 ± 0.29 -0.1%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 2287.89 ± 11.77 2289.12 ± 9.34 +0.1%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 116.47 ± 0.80 114.38 ± 3.56 -1.8%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d8192 1047.29 ± 9.19 1047.12 ± 9.51 -0.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d8192 90.74 ± 0.34 90.44 ± 0.37 -0.3%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d16384 647.46 ± 3.70 644.65 ± 3.78 -0.4%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d16384 81.92 ± 0.81 82.07 ± 0.20 +0.2%
Nvidia RTX 3090 (Coopmat1)
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 4117.11 ± 10.81 4052.19 ± 17.94 -1.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 145.98 ± 1.84 144.04 ± 0.74 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 2182.12 ± 11.97 2359.95 ± 10.14 +8.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 115.72 ± 0.56 116.46 ± 0.62 +0.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 1486.54 ± 4.89 1671.90 ± 9.35 +12.5%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 99.15 ± 0.74 101.36 ± 0.32 +2.2%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 3062.95 ± 94.07 3090.31 ± 33.32 +0.9%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 175.29 ± 0.83 175.87 ± 0.88 +0.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 2439.28 ± 32.02 2494.98 ± 47.57 +2.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 148.99 ± 14.70 154.40 ± 2.18 +3.6%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 1964.74 ± 21.60 2098.26 ± 19.00 +6.8%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 147.55 ± 0.70 147.66 ± 0.69 +0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 2839.27 ± 26.12 2837.32 ± 30.26 -0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 174.78 ± 1.25 176.05 ± 1.26 +0.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 1505.57 ± 14.41 1639.74 ± 14.94 +8.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 137.34 ± 0.86 139.22 ± 2.10 +1.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 1010.90 ± 10.49 1146.23 ± 14.19 +13.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 119.58 ± 0.71 121.95 ± 0.88 +2.0%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 1968.30 ± 10.15 1954.94 ± 33.29 -0.7%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 114.35 ± 0.87 115.05 ± 0.80 +0.6%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d8192 554.73 ± 1.56 555.49 ± 1.82 +0.1%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d8192 62.50 ± 0.51 63.21 ± 0.34 +1.1%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d16384 314.59 ± 0.93 315.91 ± 1.26 +0.4%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d16384 43.01 ± 0.10 43.98 ± 0.15 +2.3%
Nvidia RTX 3090 (Without Coopmat)
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 2129.81 ± 5.52 2081.00 ± 42.53 -2.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 145.98 ± 0.24 144.26 ± 0.53 -1.2%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 997.77 ± 3.31 1048.43 ± 25.28 +5.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 110.19 ± 0.54 112.16 ± 0.12 +1.8%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d16384 637.54 ± 1.09 701.26 ± 11.14 +10.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d16384 94.33 ± 0.22 95.27 ± 0.31 +1.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 2410.79 ± 15.88 2331.15 ± 89.00 -3.3%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 176.60 ± 0.74 173.28 ± 0.72 -1.9%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d8192 1582.99 ± 17.17 1429.18 ± 11.60 -9.7%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d8192 153.60 ± 1.60 150.58 ± 0.91 -2.0%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 pp512 @ d16384 1114.36 ± 154.82 1009.61 ± 23.16 -9.4%
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 99 1 tg128 @ d16384 146.14 ± 0.64 143.19 ± 1.18 -2.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 1159.21 ± 12.74 1137.29 ± 13.35 -1.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 177.45 ± 1.07 175.96 ± 1.95 -0.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 592.47 ± 4.68 620.55 ± 6.11 +4.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 130.00 ± 0.58 135.84 ± 1.70 +4.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d16384 387.10 ± 1.89 425.32 ± 0.85 +9.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d16384 113.49 ± 0.51 117.90 ± 0.71 +3.9%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 1050.83 ± 17.39 1092.14 ± 16.92 +3.9%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 114.66 ± 2.79 115.36 ± 3.33 +0.6%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d8192 281.20 ± 1.84 342.26 ± 2.76 +21.7%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d8192 63.73 ± 0.06 63.90 ± 0.37 +0.3%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 pp512 @ d16384 159.38 ± 1.00 202.89 ± 2.03 +27.3%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B 99 1 tg128 @ d16384 43.40 ± 0.05 44.22 ± 0.09 +1.9%

@loci-review
Copy link

loci-review bot commented Feb 15, 2026

The analysis encountered an error. Please review the Processing Details for more information.

1 similar comment
@loci-review
Copy link

loci-review bot commented Feb 15, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
@loci-dev loci-dev force-pushed the loci/pr-19625-0cc4m-vulkan-fa-scalar-opt branch from 32d504c to 378d110 Compare February 22, 2026 03:07
@loci-review
Copy link

loci-review bot commented Feb 22, 2026

No meaningful performance changes were detected across 111507 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 945fa3a to 0e8e1d6 Compare March 20, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants