Skip to content

CUDA: remove -sm row, refactor cuBLAS#24216

Open
JohannesGaessler wants to merge 4 commits into
ggml-org:masterfrom
JohannesGaessler:cuda-rm-sm-row-2
Open

CUDA: remove -sm row, refactor cuBLAS#24216
JohannesGaessler wants to merge 4 commits into
ggml-org:masterfrom
JohannesGaessler:cuda-rm-sm-row-2

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Jun 5, 2026

This PR removes CUDA backend support for split buffers (--split-mode row) - by now -sm tensor has all of the necessary features to make it obsolete. Split buffers therefore do not need to be considered for #23935 . Also, it is possible to remove a lot of legacy code that predates the ggml backend API (ggml_cuda_op_mul_mat). I refactored and deduplicated the cuBLAS code to use only a single functions for both batched and non-batched GEMM. The compute type is chosen based on speed, can be overridden with GGML_CUDA_CUBLAS_COMPUTE_TYPE. I did some A/B testing for cuBLAS configuration which unlocked some FP16/BF16 performance for some GPUs.

Performance
GPU Model Microbatch size Test t/s 7c158fb t/s d1408e5 Speedup
MI60 / MI50 llama 1B BF16 1 pp512 191.42 192.01 1.00
MI60 / MI50 llama 1B BF16 2 pp512 327.76 331.62 1.01
MI60 / MI50 llama 1B BF16 4 pp512 355.72 361.33 1.02
MI60 / MI50 llama 1B BF16 8 pp512 390.48 393.71 1.01
MI60 / MI50 llama 1B BF16 16 pp512 142.25 482.60 3.39
MI60 / MI50 llama 1B BF16 32 pp512 284.89 903.58 3.17
MI60 / MI50 llama 1B BF16 64 pp512 546.58 1266.01 2.32
MI60 / MI50 llama 1B BF16 128 pp512 1523.80 1532.72 1.01
MI60 / MI50 llama 1B BF16 256 pp512 1765.90 1779.64 1.01
MI60 / MI50 llama 1B BF16 512 pp512 1954.47 1945.21 1.00
MI60 / MI50 llama 1B F16 1 pp512 192.64 193.92 1.01
MI60 / MI50 llama 1B F16 2 pp512 329.97 331.85 1.01
MI60 / MI50 llama 1B F16 4 pp512 388.87 397.98 1.02
MI60 / MI50 llama 1B F16 8 pp512 391.07 399.73 1.02
MI60 / MI50 llama 1B F16 16 pp512 622.32 606.85 0.98
MI60 / MI50 llama 1B F16 32 pp512 1201.12 1200.30 1.00
MI60 / MI50 llama 1B F16 64 pp512 1927.95 1888.80 0.98
MI60 / MI50 llama 1B F16 128 pp512 2988.30 2985.18 1.00
MI60 / MI50 llama 1B F16 256 pp512 3617.21 3672.75 1.02
MI60 / MI50 llama 1B F16 512 pp512 5554.29 5555.11 1.00
MI60 / MI50 llama 1B all F32 1 pp512 124.06 123.91 1.00
MI60 / MI50 llama 1B all F32 2 pp512 230.72 227.55 0.99
MI60 / MI50 llama 1B all F32 4 pp512 296.98 293.74 0.99
MI60 / MI50 llama 1B all F32 8 pp512 260.68 255.93 0.98
MI60 / MI50 llama 1B all F32 16 pp512 475.20 476.01 1.00
MI60 / MI50 llama 1B all F32 32 pp512 947.08 947.79 1.00
MI60 / MI50 llama 1B all F32 64 pp512 1818.96 1825.66 1.00
MI60 / MI50 llama 1B all F32 128 pp512 2697.41 2660.78 0.99
MI60 / MI50 llama 1B all F32 256 pp512 3101.13 3032.03 0.98
MI60 / MI50 llama 1B all F32 512 pp512 3265.96 3194.20 0.98
MI100 llama 1B BF16 1 pp512 294.34 287.90 0.98
MI100 llama 1B BF16 2 pp512 472.87 465.29 0.98
MI100 llama 1B BF16 4 pp512 561.78 354.46 0.63
MI100 llama 1B BF16 8 pp512 1098.79 697.06 0.63
MI100 llama 1B BF16 16 pp512 2025.16 1342.20 0.66
MI100 llama 1B BF16 32 pp512 3515.57 2416.93 0.69
MI100 llama 1B BF16 64 pp512 6495.60 4554.00 0.70
MI100 llama 1B BF16 128 pp512 9127.54 6219.68 0.68
MI100 llama 1B BF16 256 pp512 13155.13 10362.47 0.79
MI100 llama 1B BF16 512 pp512 15960.71 15063.32 0.94
MI100 llama 1B F16 1 pp512 289.68 296.44 1.02
MI100 llama 1B F16 2 pp512 475.45 482.22 1.01
MI100 llama 1B F16 4 pp512 518.73 517.79 1.00
MI100 llama 1B F16 8 pp512 1021.62 1018.23 1.00
MI100 llama 1B F16 16 pp512 1897.49 1887.55 0.99
MI100 llama 1B F16 32 pp512 3473.95 3349.49 0.96
MI100 llama 1B F16 64 pp512 6629.11 6625.04 1.00
MI100 llama 1B F16 128 pp512 9034.29 8751.36 0.97
MI100 llama 1B F16 256 pp512 15932.49 15954.15 1.00
MI100 llama 1B F16 512 pp512 18669.41 18768.24 1.01
MI100 llama 1B all F32 1 pp512 184.23 185.66 1.01
MI100 llama 1B all F32 2 pp512 336.06 338.72 1.01
MI100 llama 1B all F32 4 pp512 177.15 177.45 1.00
MI100 llama 1B all F32 8 pp512 338.85 338.68 1.00
MI100 llama 1B all F32 16 pp512 622.06 621.99 1.00
MI100 llama 1B all F32 32 pp512 3016.91 3013.60 1.00
MI100 llama 1B all F32 64 pp512 5596.68 5641.02 1.01
MI100 llama 1B all F32 128 pp512 6904.30 6949.57 1.01
MI100 llama 1B all F32 256 pp512 8824.88 8888.15 1.01
MI100 llama 1B all F32 512 pp512 10574.72 10763.95 1.02
P40 llama 1B BF16 1 pp512 142.92 142.97 1.00
P40 llama 1B BF16 2 pp512 247.76 247.75 1.00
P40 llama 1B BF16 4 pp512 285.22 285.36 1.00
P40 llama 1B BF16 8 pp512 263.26 263.56 1.00
P40 llama 1B BF16 16 pp512 306.25 311.39 1.02
P40 llama 1B BF16 32 pp512 565.70 576.75 1.02
P40 llama 1B BF16 64 pp512 1164.15 1190.80 1.02
P40 llama 1B BF16 128 pp512 1794.15 1865.37 1.04
P40 llama 1B BF16 256 pp512 2036.45 2095.52 1.03
P40 llama 1B BF16 512 pp512 2363.14 2433.19 1.03
P40 llama 1B F16 1 pp512 141.61 141.78 1.00
P40 llama 1B F16 2 pp512 248.22 248.37 1.00
P40 llama 1B F16 4 pp512 282.94 283.16 1.00
P40 llama 1B F16 8 pp512 231.85 231.74 1.00
P40 llama 1B F16 16 pp512 398.33 398.74 1.00
P40 llama 1B F16 32 pp512 784.97 786.98 1.00
P40 llama 1B F16 64 pp512 986.82 992.26 1.01
P40 llama 1B F16 128 pp512 1578.70 1554.49 0.98
P40 llama 1B F16 256 pp512 2545.30 2542.62 1.00
P40 llama 1B F16 512 pp512 3163.62 3152.64 1.00
P40 llama 1B all F32 1 pp512 78.81 78.78 1.00
P40 llama 1B all F32 2 pp512 153.39 153.25 1.00
P40 llama 1B all F32 4 pp512 266.18 265.97 1.00
P40 llama 1B all F32 8 pp512 494.44 493.93 1.00
P40 llama 1B all F32 16 pp512 964.91 964.48 1.00
P40 llama 1B all F32 32 pp512 1818.89 1819.26 1.00
P40 llama 1B all F32 64 pp512 1484.58 1485.62 1.00
P40 llama 1B all F32 128 pp512 2133.50 2135.09 1.00
P40 llama 1B all F32 256 pp512 3247.97 3252.88 1.00
P40 llama 1B all F32 512 pp512 3606.71 3603.94 1.00
Radeon 8060S Graphics llama 1B BF16 1 pp512 77.69 77.53 1.00
Radeon 8060S Graphics llama 1B BF16 2 pp512 149.62 149.23 1.00
Radeon 8060S Graphics llama 1B BF16 4 pp512 200.09 200.02 1.00
Radeon 8060S Graphics llama 1B BF16 8 pp512 423.37 423.42 1.00
Radeon 8060S Graphics llama 1B BF16 16 pp512 411.11 411.25 1.00
Radeon 8060S Graphics llama 1B BF16 32 pp512 758.67 760.15 1.00
Radeon 8060S Graphics llama 1B BF16 64 pp512 1252.83 1246.63 1.00
Radeon 8060S Graphics llama 1B BF16 128 pp512 2189.18 2179.17 1.00
Radeon 8060S Graphics llama 1B BF16 256 pp512 3376.78 3339.39 0.99
Radeon 8060S Graphics llama 1B BF16 512 pp512 4073.21 4075.22 1.00
Radeon 8060S Graphics llama 1B F16 1 pp512 77.75 76.74 0.99
Radeon 8060S Graphics llama 1B F16 2 pp512 150.24 150.03 1.00
Radeon 8060S Graphics llama 1B F16 4 pp512 198.09 195.43 0.99
Radeon 8060S Graphics llama 1B F16 8 pp512 390.17 389.44 1.00
Radeon 8060S Graphics llama 1B F16 16 pp512 410.46 409.38 1.00
Radeon 8060S Graphics llama 1B F16 32 pp512 1074.64 1075.50 1.00
Radeon 8060S Graphics llama 1B F16 64 pp512 1720.09 1725.11 1.00
Radeon 8060S Graphics llama 1B F16 128 pp512 2084.72 2072.71 0.99
Radeon 8060S Graphics llama 1B F16 256 pp512 3359.51 3358.58 1.00
Radeon 8060S Graphics llama 1B F16 512 pp512 3771.89 3798.75 1.01
Radeon 8060S Graphics llama 1B all F32 1 pp512 51.37 51.34 1.00
Radeon 8060S Graphics llama 1B all F32 2 pp512 100.11 100.10 1.00
Radeon 8060S Graphics llama 1B all F32 4 pp512 190.98 190.71 1.00
Radeon 8060S Graphics llama 1B all F32 8 pp512 287.32 287.22 1.00
Radeon 8060S Graphics llama 1B all F32 16 pp512 331.38 327.79 0.99
Radeon 8060S Graphics llama 1B all F32 32 pp512 536.08 529.55 0.99
Radeon 8060S Graphics llama 1B all F32 64 pp512 677.55 671.29 0.99
Radeon 8060S Graphics llama 1B all F32 128 pp512 855.41 850.31 0.99
Radeon 8060S Graphics llama 1B all F32 256 pp512 972.33 970.48 1.00
Radeon 8060S Graphics llama 1B all F32 512 pp512 960.00 963.54 1.00
RTX 3090 llama 1B BF16 1 pp512 370.93 374.91 1.01
RTX 3090 llama 1B BF16 2 pp512 666.40 675.22 1.01
RTX 3090 llama 1B BF16 4 pp512 1320.79 1337.74 1.01
RTX 3090 llama 1B BF16 8 pp512 2537.07 2572.88 1.01
RTX 3090 llama 1B BF16 16 pp512 4453.99 4536.24 1.02
RTX 3090 llama 1B BF16 32 pp512 6974.66 7161.65 1.03
RTX 3090 llama 1B BF16 64 pp512 11590.54 12516.72 1.08
RTX 3090 llama 1B BF16 128 pp512 14216.30 15814.31 1.11
RTX 3090 llama 1B BF16 256 pp512 16932.85 18310.51 1.08
RTX 3090 llama 1B BF16 512 pp512 17751.37 18799.87 1.06
RTX 3090 llama 1B F16 1 pp512 371.74 379.20 1.02
RTX 3090 llama 1B F16 2 pp512 666.06 682.74 1.03
RTX 3090 llama 1B F16 4 pp512 1319.42 1349.54 1.02
RTX 3090 llama 1B F16 8 pp512 2535.93 2602.11 1.03
RTX 3090 llama 1B F16 16 pp512 4428.90 4588.29 1.04
RTX 3090 llama 1B F16 32 pp512 7428.00 7671.32 1.03
RTX 3090 llama 1B F16 64 pp512 11981.65 12340.13 1.03
RTX 3090 llama 1B F16 128 pp512 16822.95 17602.12 1.05
RTX 3090 llama 1B F16 256 pp512 22263.64 23758.31 1.07
RTX 3090 llama 1B all F32 1 pp512 212.36 213.02 1.00
RTX 3090 llama 1B all F32 2 pp512 406.18 407.89 1.00
RTX 3090 llama 1B all F32 4 pp512 752.88 756.81 1.01
RTX 3090 llama 1B all F32 8 pp512 1479.26 1486.62 1.00
RTX 3090 llama 1B all F32 16 pp512 2731.06 2748.13 1.01
RTX 3090 llama 1B all F32 32 pp512 4277.27 4295.13 1.00
RTX 3090 llama 1B all F32 64 pp512 6877.09 6925.45 1.01
RTX 3090 llama 1B all F32 128 pp512 8896.66 9077.68 1.02
RTX 3090 llama 1B all F32 256 pp512 10003.21 10126.98 1.01
RTX 3090 llama 1B all F32 512 pp512 11242.72 11559.53 1.03
RTX 4090 llama 1B BF16 1 pp512 425.77 425.99 1.00
RTX 4090 llama 1B BF16 2 pp512 789.21 789.90 1.00
RTX 4090 llama 1B BF16 4 pp512 1550.21 1551.14 1.00
RTX 4090 llama 1B BF16 8 pp512 3031.05 3036.30 1.00
RTX 4090 llama 1B BF16 16 pp512 5596.80 5592.70 1.00
RTX 4090 llama 1B BF16 32 pp512 8854.73 9249.30 1.04
RTX 4090 llama 1B BF16 64 pp512 15537.79 16387.25 1.05
RTX 4090 llama 1B BF16 128 pp512 25215.05 26624.25 1.06
RTX 4090 llama 1B BF16 256 pp512 40464.50 42534.93 1.05
RTX 4090 llama 1B BF16 512 pp512 45127.23 47941.71 1.06
RTX 4090 llama 1B F16 1 pp512 425.96 425.99 1.00
RTX 4090 llama 1B F16 2 pp512 789.44 789.75 1.00
RTX 4090 llama 1B F16 4 pp512 1549.41 1550.96 1.00
RTX 4090 llama 1B F16 8 pp512 3032.83 3035.94 1.00
RTX 4090 llama 1B F16 16 pp512 5598.15 5592.64 1.00
RTX 4090 llama 1B F16 32 pp512 9496.76 9542.09 1.00
RTX 4090 llama 1B F16 64 pp512 16176.69 16254.76 1.00
RTX 4090 llama 1B F16 128 pp512 25624.31 25830.67 1.01
RTX 4090 llama 1B F16 256 pp512 41440.12 41504.56 1.00
RTX 4090 llama 1B F16 512 pp512 50917.05 50968.09 1.00
RTX 4090 llama 1B all F32 1 pp512 233.91 233.98 1.00
RTX 4090 llama 1B all F32 2 pp512 455.91 456.04 1.00
RTX 4090 llama 1B all F32 4 pp512 875.93 876.16 1.00
RTX 4090 llama 1B all F32 8 pp512 1723.86 1725.67 1.00
RTX 4090 llama 1B all F32 16 pp512 3262.07 3261.50 1.00
RTX 4090 llama 1B all F32 32 pp512 5826.81 5842.87 1.00
RTX 4090 llama 1B all F32 64 pp512 9889.02 9930.50 1.00
RTX 4090 llama 1B all F32 128 pp512 17775.00 17840.35 1.00
RTX 4090 llama 1B all F32 256 pp512 28359.85 28386.99 1.00
RTX 4090 llama 1B all F32 512 pp512 31088.91 31255.26 1.01
RTX 5090 llama 1B BF16 1 pp512 652.28 652.55 1.00
RTX 5090 llama 1B BF16 2 pp512 1135.76 1136.47 1.00
RTX 5090 llama 1B BF16 4 pp512 2207.19 2210.61 1.00
RTX 5090 llama 1B BF16 8 pp512 4279.87 4280.67 1.00
RTX 5090 llama 1B BF16 16 pp512 7769.02 7775.26 1.00
RTX 5090 llama 1B BF16 32 pp512 11708.64 13615.80 1.16
RTX 5090 llama 1B BF16 64 pp512 22135.09 25043.46 1.13
RTX 5090 llama 1B BF16 128 pp512 36616.26 38537.14 1.05
RTX 5090 llama 1B BF16 256 pp512 45934.60 49275.97 1.07
RTX 5090 llama 1B BF16 512 pp512 50115.04 53303.91 1.06
RTX 5090 llama 1B F16 1 pp512 646.98 646.74 1.00
RTX 5090 llama 1B F16 2 pp512 1135.15 1135.60 1.00
RTX 5090 llama 1B F16 4 pp512 2207.81 2207.91 1.00
RTX 5090 llama 1B F16 8 pp512 4280.34 4282.19 1.00
RTX 5090 llama 1B F16 16 pp512 7775.83 7777.37 1.00
RTX 5090 llama 1B F16 32 pp512 11766.77 11935.76 1.01
RTX 5090 llama 1B F16 64 pp512 23257.38 23385.02 1.01
RTX 5090 llama 1B F16 128 pp512 40166.14 40217.29 1.00
RTX 5090 llama 1B F16 256 pp512 55123.24 55467.24 1.01
RTX 5090 llama 1B F16 512 pp512 62279.82 62242.57 1.00
RTX 5090 llama 1B all F32 1 pp512 383.44 383.50 1.00
RTX 5090 llama 1B all F32 2 pp512 728.13 728.41 1.00
RTX 5090 llama 1B all F32 4 pp512 1321.85 1322.33 1.00
RTX 5090 llama 1B all F32 8 pp512 2582.50 2583.16 1.00
RTX 5090 llama 1B all F32 16 pp512 4762.63 4759.02 1.00
RTX 5090 llama 1B all F32 32 pp512 9246.97 9284.36 1.00
RTX 5090 llama 1B all F32 64 pp512 17543.67 17568.27 1.00
RTX 5090 llama 1B all F32 128 pp512 24183.39 24199.60 1.00
RTX 5090 llama 1B all F32 256 pp512 34298.63 34255.12 1.00
RTX 5090 llama 1B all F32 512 pp512 35015.29 34907.33 1.00
RX 6800 llama 1B BF16 1 pp512 140.89 140.61 1.00
RX 6800 llama 1B BF16 2 pp512 223.36 223.35 1.00
RX 6800 llama 1B BF16 4 pp512 326.79 326.75 1.00
RX 6800 llama 1B BF16 8 pp512 415.66 415.52 1.00
RX 6800 llama 1B BF16 16 pp512 89.88 302.03 3.36
RX 6800 llama 1B BF16 32 pp512 178.53 591.41 3.31
RX 6800 llama 1B BF16 64 pp512 352.24 748.58 2.13
RX 6800 llama 1B BF16 128 pp512 1356.07 1379.93 1.02
RX 6800 llama 1B BF16 256 pp512 1462.92 1484.09 1.01
RX 6800 llama 1B BF16 512 pp512 1517.85 1532.81 1.01
RX 6800 llama 1B F16 1 pp512 140.21 140.53 1.00
RX 6800 llama 1B F16 2 pp512 223.39 223.92 1.00
RX 6800 llama 1B F16 4 pp512 322.83 322.89 1.00
RX 6800 llama 1B F16 8 pp512 415.47 415.42 1.00
RX 6800 llama 1B F16 16 pp512 414.82 414.78 1.00
RX 6800 llama 1B F16 32 pp512 744.28 744.84 1.00
RX 6800 llama 1B F16 64 pp512 1244.90 1246.88 1.00
RX 6800 llama 1B F16 128 pp512 2252.92 2253.02 1.00
RX 6800 llama 1B F16 256 pp512 3206.75 3208.41 1.00
RX 6800 llama 1B F16 512 pp512 4022.40 4019.53 1.00
RX 6800 llama 1B all F32 1 pp512 107.34 107.49 1.00
RX 6800 llama 1B all F32 2 pp512 195.05 194.88 1.00
RX 6800 llama 1B all F32 4 pp512 306.51 306.32 1.00
RX 6800 llama 1B all F32 8 pp512 388.63 388.68 1.00
RX 6800 llama 1B all F32 16 pp512 364.50 364.83 1.00
RX 6800 llama 1B all F32 32 pp512 683.69 685.36 1.00
RX 6800 llama 1B all F32 64 pp512 1236.01 1240.02 1.00
RX 6800 llama 1B all F32 128 pp512 1798.66 1803.83 1.00
RX 6800 llama 1B all F32 256 pp512 2239.35 2242.15 1.00
RX 6800 llama 1B all F32 512 pp512 2578.39 2580.51 1.00
RX 9060 XT llama 1B BF16 1 pp512 120.71 120.38 1.00
RX 9060 XT llama 1B BF16 2 pp512 215.73 215.13 1.00
RX 9060 XT llama 1B BF16 4 pp512 315.51 315.12 1.00
RX 9060 XT llama 1B BF16 8 pp512 620.76 620.78 1.00
RX 9060 XT llama 1B BF16 16 pp512 1039.74 1037.91 1.00
RX 9060 XT llama 1B BF16 32 pp512 337.31 933.11 2.77
RX 9060 XT llama 1B BF16 64 pp512 681.73 1272.24 1.87
RX 9060 XT llama 1B BF16 128 pp512 2242.52 2291.23 1.02
RX 9060 XT llama 1B BF16 256 pp512 2462.30 2500.53 1.02
RX 9060 XT llama 1B BF16 512 pp512 2534.76 2566.46 1.01
RX 9060 XT llama 1B F16 1 pp512 121.98 122.22 1.00
RX 9060 XT llama 1B F16 2 pp512 213.85 213.93 1.00
RX 9060 XT llama 1B F16 4 pp512 334.01 334.19 1.00
RX 9060 XT llama 1B F16 8 pp512 623.89 623.55 1.00
RX 9060 XT llama 1B F16 16 pp512 1050.26 1049.92 1.00
RX 9060 XT llama 1B F16 32 pp512 959.87 968.22 1.01
RX 9060 XT llama 1B F16 64 pp512 1325.28 1322.83 1.00
RX 9060 XT llama 1B F16 128 pp512 2440.64 2441.79 1.00
RX 9060 XT llama 1B F16 256 pp512 2676.12 2678.64 1.00
RX 9060 XT llama 1B F16 512 pp512 2744.70 2745.93 1.00
RX 9060 XT llama 1B all F32 1 pp512 69.99 69.95 1.00
RX 9060 XT llama 1B all F32 2 pp512 137.41 137.46 1.00
RX 9060 XT llama 1B all F32 4 pp512 246.63 246.63 1.00
RX 9060 XT llama 1B all F32 8 pp512 418.84 418.86 1.00
RX 9060 XT llama 1B all F32 16 pp512 568.00 569.78 1.00
RX 9060 XT llama 1B all F32 32 pp512 985.73 987.52 1.00
RX 9060 XT llama 1B all F32 64 pp512 2006.87 2014.80 1.00
RX 9060 XT llama 1B all F32 128 pp512 2853.30 2851.53 1.00
RX 9060 XT llama 1B all F32 256 pp512 3235.11 3257.06 1.01
RX 9060 XT llama 1B all F32 512 pp512 3335.51 3357.20 1.01
V100-PCIE-32GB llama 1B BF16 1 pp512 325.62 325.01 1.00
V100-PCIE-32GB llama 1B BF16 2 pp512 537.75 537.17 1.00
V100-PCIE-32GB llama 1B BF16 4 pp512 653.89 652.37 1.00
V100-PCIE-32GB llama 1B BF16 8 pp512 869.30 868.18 1.00
V100-PCIE-32GB llama 1B BF16 16 pp512 284.82 287.55 1.01
V100-PCIE-32GB llama 1B BF16 32 pp512 566.06 570.50 1.01
V100-PCIE-32GB llama 1B BF16 64 pp512 1115.64 1123.02 1.01
V100-PCIE-32GB llama 1B BF16 128 pp512 2066.07 2091.14 1.01
V100-PCIE-32GB llama 1B BF16 256 pp512 3210.61 3281.34 1.02
V100-PCIE-32GB llama 1B BF16 512 pp512 3995.61 4134.26 1.03
V100-PCIE-32GB llama 1B F16 1 pp512 328.50 330.47 1.01
V100-PCIE-32GB llama 1B F16 2 pp512 569.55 571.23 1.00
V100-PCIE-32GB llama 1B F16 4 pp512 1106.23 1109.90 1.00
V100-PCIE-32GB llama 1B F16 8 pp512 2060.61 2066.17 1.00
V100-PCIE-32GB llama 1B F16 16 pp512 3614.54 3624.31 1.00
V100-PCIE-32GB llama 1B F16 32 pp512 5629.71 5640.56 1.00
V100-PCIE-32GB llama 1B F16 64 pp512 8048.95 8064.94 1.00
V100-PCIE-32GB llama 1B F16 128 pp512 12657.67 12631.35 1.00
V100-PCIE-32GB llama 1B F16 256 pp512 18538.26 18519.29 1.00
V100-PCIE-32GB llama 1B F16 512 pp512 21341.59 21272.21 1.00
V100-PCIE-32GB llama 1B all F32 1 pp512 194.77 194.28 1.00
V100-PCIE-32GB llama 1B all F32 2 pp512 373.55 372.24 1.00
V100-PCIE-32GB llama 1B all F32 4 pp512 581.72 579.36 1.00
V100-PCIE-32GB llama 1B all F32 8 pp512 980.29 979.10 1.00
V100-PCIE-32GB llama 1B all F32 16 pp512 1698.98 1699.42 1.00
V100-PCIE-32GB llama 1B all F32 32 pp512 3404.69 3394.02 1.00
V100-PCIE-32GB llama 1B all F32 64 pp512 4043.20 4033.91 1.00
V100-PCIE-32GB llama 1B all F32 128 pp512 5283.98 5281.24 1.00
V100-PCIE-32GB llama 1B all F32 256 pp512 5358.98 5366.09 1.00
V100-PCIE-32GB llama 1B all F32 512 pp512 5776.31 5717.11 0.99

Requirements

@JohannesGaessler JohannesGaessler requested review from a team and CISC as code owners June 5, 2026 22:01
@github-actions github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

JohannesGaessler commented Jun 5, 2026

Sorry, I accidentally pushed the wrong logic for CDNA + BF16. This is the performance with the correct logic:

Performance
GPU Model Microbatch size Test t/s 7c158fb t/s 5926df7 Speedup
MI100 llama 1B BF16 1 pp512 294.34 296.41 1.01
MI100 llama 1B BF16 2 pp512 472.87 480.12 1.02
MI100 llama 1B BF16 4 pp512 561.78 570.92 1.02
MI100 llama 1B BF16 8 pp512 1098.79 1105.67 1.01
MI100 llama 1B BF16 16 pp512 2025.16 2030.43 1.00
MI100 llama 1B BF16 32 pp512 3515.57 3579.49 1.02
MI100 llama 1B BF16 64 pp512 6495.60 6476.93 1.00
MI100 llama 1B BF16 128 pp512 9127.54 9335.99 1.02
MI100 llama 1B BF16 256 pp512 13155.13 13325.05 1.01
MI100 llama 1B BF16 512 pp512 15960.71 16296.73 1.02
MI100 llama 1B F16 1 pp512 289.68 296.08 1.02
MI100 llama 1B F16 2 pp512 475.45 476.96 1.00
MI100 llama 1B F16 4 pp512 518.73 519.07 1.00
MI100 llama 1B F16 8 pp512 1021.62 1017.50 1.00
MI100 llama 1B F16 16 pp512 1897.49 1898.38 1.00
MI100 llama 1B F16 32 pp512 3473.95 3584.72 1.03
MI100 llama 1B F16 64 pp512 6629.11 6654.02 1.00
MI100 llama 1B F16 128 pp512 9034.29 8697.71 0.96
MI100 llama 1B F16 256 pp512 15932.49 15949.21 1.00
MI100 llama 1B F16 512 pp512 18669.41 19090.68 1.02
MI100 llama 1B all F32 1 pp512 184.23 187.24 1.02
MI100 llama 1B all F32 2 pp512 336.06 342.03 1.02
MI100 llama 1B all F32 4 pp512 177.15 177.13 1.00
MI100 llama 1B all F32 8 pp512 338.85 339.65 1.00
MI100 llama 1B all F32 16 pp512 622.06 623.62 1.00
MI100 llama 1B all F32 32 pp512 3016.91 3046.75 1.01
MI100 llama 1B all F32 64 pp512 5596.68 5668.64 1.01
MI100 llama 1B all F32 128 pp512 6904.30 7021.56 1.02
MI100 llama 1B all F32 256 pp512 8824.88 8984.72 1.02
MI100 llama 1B all F32 512 pp512 10574.72 10815.19 1.02

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

The CI was failing because I had accidentally copied a return statement when restructuring the code. The code paths on which I tested the performance were unaffected.

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants