Skip to content

UPSTREAM PR #19057: CUDA: re-use MLA K data for V in MMA FA#1014

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19057-branch_JohannesGaessler-cuda-fa-v-is-k
Open

UPSTREAM PR #19057: CUDA: re-use MLA K data for V in MMA FA#1014
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19057-branch_JohannesGaessler-cuda-fa-v-is-k

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19057

Follow-up to ggml-org/llama.cpp#18986 .

This PR re-enables a performance optimization in the CUDA MMA FlashAttention kernel that re-uses part of the K data for V.

Performance
GPU Model Microbatch size Test t/s b7818 t/s f5cfe16 Speedup
RTX 3090 deepseek2 16B Q4_0 1 pp512@d32768 140.61 156.74 1.11
RTX 3090 deepseek2 16B Q4_0 2 pp512@d32768 158.14 163.31 1.03
RTX 3090 deepseek2 16B Q4_0 4 pp512@d32768 251.75 258.88 1.03
RTX 3090 deepseek2 16B Q4_0 8 pp512@d32768 383.29 399.11 1.04
RTX 3090 deepseek2 16B Q4_0 16 pp512@d32768 554.66 587.24 1.06
RTX 3090 deepseek2 16B Q4_0 32 pp512@d32768 728.44 770.71 1.06
RTX 3090 deepseek2 16B Q4_0 64 pp512@d32768 877.92 946.42 1.08
RTX 3090 deepseek2 16B Q4_0 128 pp512@d32768 971.13 1033.12 1.06
RTX 3090 deepseek2 16B Q4_0 256 pp512@d32768 1016.54 1060.38 1.04
RTX 3090 deepseek2 16B Q4_0 512 pp512@d32768 1110.34 1206.51 1.09
RTX 3090 deepseek2 ?B Q2_K_M 1 pp512@d32768 43.77 52.35 1.20
RTX 3090 deepseek2 ?B Q2_K_M 2 pp512@d32768 63.96 72.39 1.13
RTX 3090 deepseek2 ?B Q2_K_M 4 pp512@d32768 106.03 122.28 1.15
RTX 3090 deepseek2 ?B Q2_K_M 8 pp512@d32768 189.30 197.41 1.04
RTX 3090 deepseek2 ?B Q2_K_M 16 pp512@d32768 297.97 308.39 1.03
RTX 3090 deepseek2 ?B Q2_K_M 32 pp512@d32768 368.21 373.97 1.02
RTX 3090 deepseek2 ?B Q2_K_M 64 pp512@d32768 407.80 429.07 1.05
RTX 3090 deepseek2 ?B Q2_K_M 128 pp512@d32768 431.24 458.08 1.06
RTX 3090 deepseek2 ?B Q2_K_M 256 pp512@d32768 524.06 539.69 1.03
RTX 3090 deepseek2 ?B Q2_K_M 512 pp512@d32768 598.11 644.65 1.08
RTX 4090 deepseek2 16B Q4_0 1 pp512@d32768 182.18 183.62 1.01
RTX 4090 deepseek2 16B Q4_0 2 pp512@d32768 221.62 223.21 1.01
RTX 4090 deepseek2 16B Q4_0 4 pp512@d32768 369.57 372.53 1.01
RTX 4090 deepseek2 16B Q4_0 8 pp512@d32768 617.32 626.49 1.01
RTX 4090 deepseek2 16B Q4_0 16 pp512@d32768 965.72 988.45 1.02
RTX 4090 deepseek2 16B Q4_0 32 pp512@d32768 1397.13 1447.24 1.04
RTX 4090 deepseek2 16B Q4_0 64 pp512@d32768 1873.01 1958.05 1.05
RTX 4090 deepseek2 16B Q4_0 128 pp512@d32768 2158.83 2270.64 1.05
RTX 4090 deepseek2 16B Q4_0 256 pp512@d32768 2605.70 2790.26 1.07
RTX 4090 deepseek2 16B Q4_0 512 pp512@d32768 2850.17 3088.95 1.08
RTX 4090 deepseek2 ?B Q2_K_M 1 pp512@d32768 81.98 99.01 1.21
RTX 4090 deepseek2 ?B Q2_K_M 2 pp512@d32768 113.95 129.08 1.13
RTX 4090 deepseek2 ?B Q2_K_M 4 pp512@d32768 196.74 212.86 1.08
RTX 4090 deepseek2 ?B Q2_K_M 8 pp512@d32768 333.54 344.98 1.03
RTX 4090 deepseek2 ?B Q2_K_M 16 pp512@d32768 551.75 566.78 1.03
RTX 4090 deepseek2 ?B Q2_K_M 32 pp512@d32768 760.50 792.06 1.04
RTX 4090 deepseek2 ?B Q2_K_M 64 pp512@d32768 912.59 962.60 1.05
RTX 4090 deepseek2 ?B Q2_K_M 128 pp512@d32768 1010.51 1074.58 1.06
RTX 4090 deepseek2 ?B Q2_K_M 256 pp512@d32768 1235.87 1330.60 1.08
RTX 4090 deepseek2 ?B Q2_K_M 512 pp512@d32768 1358.25 1465.94 1.08
RTX 5090 deepseek2 16B Q4_0 1 pp512@d32768 180.70 183.59 1.02
RTX 5090 deepseek2 16B Q4_0 2 pp512@d32768 210.08 211.61 1.01
RTX 5090 deepseek2 16B Q4_0 4 pp512@d32768 376.96 379.17 1.01
RTX 5090 deepseek2 16B Q4_0 8 pp512@d32768 692.17 700.43 1.01
RTX 5090 deepseek2 16B Q4_0 16 pp512@d32768 1041.68 1063.63 1.02
RTX 5090 deepseek2 16B Q4_0 32 pp512@d32768 1601.00 1649.08 1.03
RTX 5090 deepseek2 16B Q4_0 64 pp512@d32768 2152.07 2230.35 1.04
RTX 5090 deepseek2 16B Q4_0 128 pp512@d32768 2576.08 2698.83 1.05
RTX 5090 deepseek2 16B Q4_0 256 pp512@d32768 3196.82 3386.39 1.06
RTX 5090 deepseek2 16B Q4_0 512 pp512@d32768 3603.42 3851.28 1.07
RTX 5090 deepseek2 ?B Q2_K_M 1 pp512@d32768 103.95 126.15 1.21
RTX 5090 deepseek2 ?B Q2_K_M 2 pp512@d32768 116.74 129.73 1.11
RTX 5090 deepseek2 ?B Q2_K_M 4 pp512@d32768 202.69 219.00 1.08
RTX 5090 deepseek2 ?B Q2_K_M 8 pp512@d32768 354.23 363.73 1.03
RTX 5090 deepseek2 ?B Q2_K_M 16 pp512@d32768 566.62 578.24 1.02
RTX 5090 deepseek2 ?B Q2_K_M 32 pp512@d32768 799.68 825.30 1.03
RTX 5090 deepseek2 ?B Q2_K_M 64 pp512@d32768 1007.89 1048.39 1.04
RTX 5090 deepseek2 ?B Q2_K_M 128 pp512@d32768 1144.56 1195.92 1.04
RTX 5090 deepseek2 ?B Q2_K_M 256 pp512@d32768 1473.16 1558.72 1.06
RTX 5090 deepseek2 ?B Q2_K_M 512 pp512@d32768 1695.63 1810.36 1.07
RX 9060 XT deepseek2 16B Q4_0 1 pp512@d32768 50.39 53.05 1.05
RX 9060 XT deepseek2 16B Q4_0 2 pp512@d32768 48.93 58.73 1.20
RX 9060 XT deepseek2 16B Q4_0 4 pp512@d32768 70.12 75.67 1.08
RX 9060 XT deepseek2 16B Q4_0 8 pp512@d32768 85.39 93.94 1.10
RX 9060 XT deepseek2 16B Q4_0 16 pp512@d32768 99.75 110.83 1.11
RX 9060 XT deepseek2 16B Q4_0 32 pp512@d32768 107.35 120.25 1.12
RX 9060 XT deepseek2 16B Q4_0 64 pp512@d32768 161.86 167.05 1.03
RX 9060 XT deepseek2 16B Q4_0 128 pp512@d32768 169.54 177.07 1.04
RX 9060 XT deepseek2 16B Q4_0 256 pp512@d32768 180.34 188.13 1.04
RX 9060 XT deepseek2 16B Q4_0 512 pp512@d32768 187.45 193.32 1.03
V100-PCIE-32GB deepseek2 16B Q4_0 1 pp512@d32768 90.18 89.34 0.99
V100-PCIE-32GB deepseek2 16B Q4_0 2 pp512@d32768 84.90 81.85 0.96
V100-PCIE-32GB deepseek2 16B Q4_0 4 pp512@d32768 132.86 130.78 0.98
V100-PCIE-32GB deepseek2 16B Q4_0 8 pp512@d32768 198.58 195.39 0.98
V100-PCIE-32GB deepseek2 16B Q4_0 16 pp512@d32768 260.22 260.29 1.00
V100-PCIE-32GB deepseek2 16B Q4_0 32 pp512@d32768 316.55 321.37 1.02
V100-PCIE-32GB deepseek2 16B Q4_0 64 pp512@d32768 304.07 309.34 1.02
V100-PCIE-32GB deepseek2 16B Q4_0 128 pp512@d32768 356.90 370.65 1.04
V100-PCIE-32GB deepseek2 16B Q4_0 256 pp512@d32768 436.12 435.77 1.00
V100-PCIE-32GB deepseek2 16B Q4_0 512 pp512@d32768 476.38 458.96 0.96
V100-PCIE-32GB deepseek2 ?B Q2_K_M 1 pp512@d32768 42.75 42.54 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 2 pp512@d32768 44.32 44.19 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 4 pp512@d32768 59.97 59.74 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 8 pp512@d32768 87.44 85.55 0.98
V100-PCIE-32GB deepseek2 ?B Q2_K_M 16 pp512@d32768 129.76 130.59 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 32 pp512@d32768 150.99 152.57 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 64 pp512@d32768 156.93 159.18 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 128 pp512@d32768 191.49 198.49 1.04
V100-PCIE-32GB deepseek2 ?B Q2_K_M 256 pp512@d32768 222.48 237.37 1.07
V100-PCIE-32GB deepseek2 ?B Q2_K_M 512 pp512@d32768 240.26 258.10 1.07

@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

5 similar comments
@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-review
Copy link

loci-review bot commented Jan 23, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from edd4e32 to d549af4 Compare January 27, 2026 06:14
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 5fea2ef to 8a7ef20 Compare January 31, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants