Skip to content

Graph parallel for Qwen-3.5-MoE#1347

Merged
ikawrakow merged 3 commits intomainfrom
ik/sm_graph_qwen35moe
Mar 2, 2026
Merged

Graph parallel for Qwen-3.5-MoE#1347
ikawrakow merged 3 commits intomainfrom
ik/sm_graph_qwen35moe

Conversation

@ikawrakow
Copy link
Owner

As with graph parallel for Qwen3-Next and the dense Qwen-3.5 models, recurrent attention layers are not parallelized over GPUs.

My guess is that graph parallel will do nothing for hybrid inference.

But for Qwen-3.5-197B-A17B-IQ2_KL fully offloaded on an 8x3090 system, I do observe a small benefit from graph parallel (a.k.a., split mode graph). This model has only 2 KV attention heads, so using more than 2 GPUs at a time only sows things down. Here are some sweep-bench results

Split mode graph (-sm graph --max-gpu 2)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 2.052 998.16 3.156 40.55
2048 128 2048 1.948 1051.54 3.147 40.68
2048 128 4096 1.946 1052.37 3.117 41.06
2048 128 6144 1.969 1039.89 3.128 40.92
2048 128 8192 1.981 1033.79 3.141 40.75
2048 128 10240 1.996 1026.22 3.156 40.55
2048 128 12288 2.023 1012.49 3.185 40.19
2048 128 14336 2.041 1003.66 3.195 40.06
2048 128 16384 2.059 994.57 3.201 39.99
2048 128 18432 2.073 987.76 3.204 39.95
2048 128 20480 2.094 977.84 3.211 39.86
2048 128 22528 2.106 972.31 3.246 39.44
2048 128 24576 2.128 962.55 3.250 39.39
2048 128 26624 2.142 956.21 3.259 39.28
2048 128 28672 2.165 945.95 3.266 39.19
2048 128 30720 2.187 936.59 3.277 39.06
2048 128 32768 2.205 928.64 3.311 38.66
2048 128 34816 2.220 922.58 3.318 38.58
2048 128 36864 2.249 910.72 3.316 38.60
2048 128 38912 2.264 904.48 3.327 38.47
2048 128 40960 2.281 897.82 3.328 38.46
2048 128 43008 2.298 891.03 3.362 38.07
2048 128 45056 2.322 882.03 3.371 37.97
2048 128 47104 2.344 873.66 3.380 37.86
2048 128 49152 2.361 867.46 3.385 37.82
2048 128 51200 2.384 859.04 3.396 37.69
2048 128 53248 2.403 852.19 3.420 37.42
2048 128 55296 2.431 842.37 3.426 37.36
2048 128 57344 2.445 837.62 3.437 37.24
2048 128 59392 2.469 829.55 3.444 37.16
2048 128 61440 2.486 823.79 3.449 37.12
2048 128 63488 2.509 816.25 3.486 36.71

Split mode layer

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 2.163 946.73 3.140 40.77
2048 128 2048 2.110 970.62 3.113 41.11
2048 128 4096 2.127 962.76 3.117 41.07
2048 128 6144 2.167 945.09 3.150 40.64
2048 128 8192 2.192 934.35 3.170 40.38
2048 128 10240 2.220 922.71 3.202 39.98
2048 128 12288 2.259 906.49 3.228 39.66
2048 128 14336 2.282 897.45 3.248 39.40
2048 128 16384 2.322 881.92 3.283 38.99
2048 128 18432 2.350 871.44 3.295 38.85
2048 128 20480 2.381 860.31 3.314 38.63
2048 128 22528 2.411 849.57 3.351 38.19
2048 128 24576 2.450 835.95 3.362 38.07
2048 128 26624 2.472 828.47 3.398 37.67
2048 128 28672 2.516 813.91 3.412 37.52
2048 128 30720 2.544 804.96 3.430 37.32
2048 128 32768 2.586 791.92 3.468 36.91
2048 128 34816 2.612 784.10 3.480 36.78
2048 128 36864 2.652 772.20 3.517 36.40
2048 128 38912 2.683 763.23 3.531 36.25
2048 128 40960 2.744 746.29 3.551 36.05
2048 128 43008 2.748 745.18 3.585 35.71
2048 128 45056 2.777 737.51 3.599 35.56
2048 128 47104 2.811 728.62 3.628 35.28
2048 128 49152 2.853 717.95 3.650 35.07
2048 128 51200 2.882 710.57 3.668 34.89
2048 128 53248 2.923 700.72 3.703 34.57
2048 128 55296 2.958 692.34 3.721 34.40
2048 128 57344 2.983 686.67 3.740 34.22
2048 128 59392 3.019 678.41 3.773 33.93
2048 128 61440 3.048 672.00 3.792 33.76
2048 128 63488 3.073 666.36 3.830 33.42

@ikawrakow ikawrakow merged commit d239dab into main Mar 2, 2026
@ubergarm
Copy link
Contributor

ubergarm commented Mar 2, 2026

Have some sweep-bench -sm graph results for full 2xGPU offload of ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW) here:

sweep-bench-Qwen3 5-122B-A10B-PR1347
👈 Details

-sm layer

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  --merge-qkv \
  -sm layer \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.162 1894.49 2.190 58.45
4096 128 4096 2.255 1816.16 2.160 59.26
4096 128 8192 2.346 1745.71 2.189 58.47
4096 128 12288 2.452 1670.64 2.212 57.87
4096 128 16384 2.560 1600.31 2.241 57.10
4096 128 20480 2.656 1541.91 2.248 56.94
4096 128 24576 2.750 1489.30 2.273 56.32
4096 128 28672 2.846 1438.99 2.298 55.69
4096 128 32768 2.945 1390.76 2.329 54.97
4096 128 36864 3.035 1349.41 2.332 54.89
4096 128 40960 3.133 1307.39 2.359 54.27
4096 128 45056 3.225 1270.12 2.386 53.65
4096 128 49152 3.324 1232.24 2.412 53.07
4096 128 53248 3.420 1197.62 2.418 52.94
4096 128 57344 3.504 1168.83 2.442 52.41
4096 128 61440 3.617 1132.43 2.469 51.85
4096 128 65536 3.688 1110.64 2.497 51.26
4096 128 69632 3.800 1077.80 2.505 51.10
4096 128 73728 3.900 1050.21 2.528 50.63
4096 128 77824 3.985 1027.92 2.552 50.15
4096 128 81920 4.070 1006.40 2.582 49.57
4096 128 86016 4.162 984.21 2.590 49.43
4096 128 90112 4.272 958.74 2.616 48.93
4096 128 94208 4.354 940.75 2.640 48.49
4096 128 98304 4.462 917.88 2.662 48.08
4096 128 102400 4.539 902.50 2.686 47.65
4096 128 106496 4.636 883.50 2.695 47.49
4096 128 110592 4.725 866.83 2.722 47.03
4096 128 114688 4.826 848.68 2.748 46.59
4096 128 118784 4.910 834.17 2.776 46.11
4096 128 122880 5.008 817.84 2.780 46.04
4096 128 126976 5.101 802.99 2.804 45.65
4096 128 131072 5.191 789.02 2.832 45.20

-sm graph

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.084 1965.24 2.443 52.39
4096 128 4096 2.120 1932.28 2.401 53.30
4096 128 8192 2.185 1874.88 2.415 53.01
4096 128 12288 2.233 1834.48 2.449 52.27
4096 128 16384 2.292 1786.87 2.454 52.16
4096 128 20480 2.341 1749.32 2.460 52.03
4096 128 24576 2.407 1701.54 2.485 51.50
4096 128 28672 2.469 1659.25 2.492 51.37
4096 128 32768 2.526 1621.22 2.517 50.85
4096 128 36864 2.582 1586.34 2.520 50.80
4096 128 40960 2.637 1553.19 2.524 50.71
4096 128 45056 2.693 1520.74 2.547 50.26
4096 128 49152 2.747 1490.95 2.550 50.19
4096 128 53248 2.804 1460.72 2.560 50.00
4096 128 57344 2.845 1439.67 2.587 49.47
4096 128 61440 2.899 1412.80 2.593 49.37
4096 128 65536 2.941 1392.85 2.615 48.95
4096 128 69632 3.003 1363.85 2.620 48.85
4096 128 73728 3.053 1341.43 2.624 48.77
4096 128 77824 3.099 1321.57 2.647 48.35
4096 128 81920 3.161 1295.68 2.651 48.29
4096 128 86016 3.207 1277.22 2.664 48.05
4096 128 90112 3.261 1256.03 2.679 47.78
4096 128 94208 3.318 1234.57 2.684 47.68
4096 128 98304 3.359 1219.55 2.707 47.28
4096 128 102400 3.422 1196.92 2.710 47.23
4096 128 106496 3.469 1180.86 2.720 47.07
4096 128 110592 3.525 1161.99 2.739 46.73
4096 128 114688 3.567 1148.32 2.741 46.70
4096 128 118784 3.638 1125.97 2.765 46.30
4096 128 122880 3.676 1114.25 2.771 46.19
4096 128 126976 3.732 1097.54 2.775 46.13
4096 128 131072 3.773 1085.58 2.798 45.75

Despite likely not having better performance for hybrid CPU+GPU as you mentioned, I tried it anyway, but was getting some errors and didn't get any results yet:

👈 Details
./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    -sm graph \
    -ngl 999 \
    --n-cpu-moe 40 \
    -ts 48,48 \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:3896
  cudaStreamSynchronize(cuda_ctx->stream())
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:132: CUDA error
./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    -sm graph \
    -ngl 999 \
    -ts 48,48 \
    --n-cpu-moe 36 \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
CUDA error: an illegal memory access was encountered
  current device: 0, in function launch_mul_mat_q_id at /home/w/projects/ik_llama.cpp/gml/src/ggml-cuda/template-instances/../mmq_id_common.cuh:3976
  cudaFuncSetAttribute((mul_mat_q_id<type, mmq_x, false>), cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:132: CUDA error

@vikcious
Copy link

vikcious commented Mar 2, 2026

Here with 8x 3090 and your ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)

main: n_kv_max = 135168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |    128 |      0 |    2.808 |  1458.69 |    2.523 |    50.73 |
|  4096 |    128 |   4096 |    2.915 |  1405.08 |    2.602 |    49.19 |
|  4096 |    128 |   8192 |    2.920 |  1402.78 |    2.543 |    50.33 |
|  4096 |    128 |  12288 |    2.987 |  1371.18 |    2.558 |    50.05 |
|  4096 |    128 |  16384 |    3.033 |  1350.38 |    2.555 |    50.10 |
|  4096 |    128 |  20480 |    3.124 |  1311.28 |    2.529 |    50.62 |
|  4096 |    128 |  24576 |    3.253 |  1259.24 |    2.571 |    49.80 |
|  4096 |    128 |  28672 |    3.351 |  1222.19 |    2.596 |    49.31 |
|  4096 |    128 |  32768 |    3.424 |  1196.40 |    2.656 |    48.18 |
|  4096 |    128 |  36864 |    3.562 |  1149.90 |    2.659 |    48.14 |
|  4096 |    128 |  40960 |    3.624 |  1130.30 |    2.641 |    48.46 |
|  4096 |    128 |  45056 |    3.739 |  1095.35 |    2.662 |    48.08 |
|  4096 |    128 |  49152 |    3.797 |  1078.73 |    2.701 |    47.39 |
|  4096 |    128 |  53248 |    3.936 |  1040.57 |    2.709 |    47.25 |
|  4096 |    128 |  57344 |    4.027 |  1017.01 |    2.725 |    46.98 |
|  4096 |    128 |  61440 |    4.097 |   999.70 |    2.709 |    47.25 |
|  4096 |    128 |  65536 |    4.168 |   982.67 |    2.726 |    46.95 |
|  4096 |    128 |  69632 |    4.322 |   947.66 |    2.753 |    46.50 |
|  4096 |    128 |  73728 |    4.384 |   934.41 |    2.830 |    45.23 |
|  4096 |    128 |  77824 |    4.505 |   909.28 |    2.798 |    45.75 |
|  4096 |    128 |  81920 |    4.581 |   894.08 |    2.834 |    45.17 |
|  4096 |    128 |  86016 |    4.730 |   865.93 |    2.885 |    44.37 |
|  4096 |    128 |  90112 |    4.797 |   853.86 |    2.858 |    44.78 |
|  4096 |    128 |  94208 |    4.869 |   841.29 |    2.860 |    44.75 |
|  4096 |    128 |  98304 |    4.975 |   823.34 |    2.879 |    44.45 |
|  4096 |    128 | 102400 |    5.076 |   806.98 |    2.899 |    44.16 |
|  4096 |    128 | 106496 |    5.195 |   788.42 |    3.012 |    42.49 |
|  4096 |    128 | 110592 |    5.347 |   766.03 |    2.983 |    42.91 |
|  4096 |    128 | 114688 |    5.379 |   761.46 |    2.939 |    43.56 |
|  4096 |    128 | 118784 |    5.493 |   745.73 |    3.041 |    42.08 |
|  4096 |    128 | 122880 |    5.654 |   724.44 |    2.996 |    42.72 |
|  4096 |    128 | 126976 |    5.652 |   724.72 |    3.018 |    42.42 |
|  4096 |    128 | 131072 |    5.823 |   703.47 |    3.065 |    41.77 |

@vikcious
Copy link

vikcious commented Mar 2, 2026

Same, but this time with "--max-gpu 2" improves it further:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |    128 |      0 |    2.305 |  1777.30 |    2.243 |    57.06 |
|  4096 |    128 |   4096 |    2.407 |  1701.96 |    2.234 |    57.30 |
|  4096 |    128 |   8192 |    2.443 |  1676.65 |    2.265 |    56.52 |
|  4096 |    128 |  12288 |    2.528 |  1620.05 |    2.280 |    56.14 |
|  4096 |    128 |  16384 |    2.690 |  1522.47 |    2.262 |    56.58 |
|  4096 |    128 |  20480 |    2.795 |  1465.50 |    2.311 |    55.39 |
|  4096 |    128 |  24576 |    2.887 |  1418.91 |    2.359 |    54.26 |
|  4096 |    128 |  28672 |    3.034 |  1350.19 |    2.363 |    54.18 |
|  4096 |    128 |  32768 |    3.155 |  1298.31 |    2.420 |    52.89 |
|  4096 |    128 |  36864 |    3.253 |  1259.09 |    2.434 |    52.59 |
|  4096 |    128 |  40960 |    3.326 |  1231.55 |    2.388 |    53.60 |
|  4096 |    128 |  45056 |    3.473 |  1179.29 |    2.424 |    52.81 |
|  4096 |    128 |  49152 |    3.803 |  1077.09 |    2.443 |    52.39 |
|  4096 |    128 |  53248 |    4.054 |  1010.45 |    3.041 |    42.09 |
|  4096 |    128 |  57344 |    3.821 |  1071.99 |    2.701 |    47.38 |
|  4096 |    128 |  61440 |    3.954 |  1036.03 |    2.684 |    47.69 |
|  4096 |    128 |  65536 |    4.082 |  1003.42 |    2.739 |    46.74 |
|  4096 |    128 |  69632 |    4.167 |   983.05 |    2.850 |    44.91 |
|  4096 |    128 |  73728 |    4.216 |   971.48 |    2.732 |    46.85 |
|  4096 |    128 |  77824 |    4.318 |   948.61 |    2.688 |    47.62 |
|  4096 |    128 |  81920 |    4.414 |   927.87 |    2.689 |    47.61 |
|  4096 |    128 |  86016 |    4.542 |   901.82 |    2.705 |    47.32 |
|  4096 |    128 |  90112 |    4.634 |   883.98 |    2.745 |    46.63 |
|  4096 |    128 |  94208 |    4.779 |   857.03 |    2.688 |    47.62 |
|  4096 |    128 |  98304 |    4.868 |   841.49 |    2.742 |    46.67 |
|  4096 |    128 | 102400 |    4.974 |   823.43 |    2.765 |    46.29 |
|  4096 |    128 | 106496 |    5.091 |   804.60 |    2.839 |    45.09 |
|  4096 |    128 | 110592 |    5.218 |   785.05 |    2.787 |    45.93 |
|  4096 |    128 | 114688 |    5.354 |   765.00 |    2.822 |    45.36 |
|  4096 |    128 | 118784 |    5.417 |   756.21 |    2.906 |    44.05 |
|  4096 |    128 | 122880 |    5.563 |   736.28 |    2.917 |    43.88 |
|  4096 |    128 | 126976 |    5.580 |   734.07 |    2.924 |    43.77 |
|  4096 |    128 | 131072 |    5.785 |   708.02 |    2.901 |    44.12 |

@ubergarm
Copy link
Contributor

ubergarm commented Mar 2, 2026

@vikcious

Thanks for sharing some results and using the --max-gpu 2 and presumably -sm graph features!

I tossed your data into my above plot. I wonder if you used only 3 or 4 of your 8x GPUs if it would actually be faster (given this model is just too large for exactly 2x3090s)

sweep-bench-Qwen3 5-122B-A10B-vik-PR1347

@magikRUKKOLA
Copy link

magikRUKKOLA commented Mar 4, 2026

@ikawrakow

My guess is that graph parallel will do nothing for hybrid inference.

Moreover, it outputs a gibberish (for the CPU/GPU config with --cpu-moe). Should I create an issue?
The LLM in question is: Qwen3.5-397B-A17B-IQ4_KSS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants