Conversation
|
Have some sweep-bench
👈 Details-sm layer
-sm graph
Despite likely not having better performance for hybrid CPU+GPU as you mentioned, I tried it anyway, but was getting some errors and didn't get any results yet: 👈 Details./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 40960 \
-ger \
-sm graph \
-ngl 999 \
--n-cpu-moe 40 \
-ts 48,48 \
-ub 4096 -b 4096 \
--threads 24 \
--no-mmap \
--warmup-batch \
-n 64
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:3896
cudaStreamSynchronize(cuda_ctx->stream())
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:132: CUDA error |
|
Here with 8x 3090 and your ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW) main: n_kv_max = 135168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 1, n_threads_batch = 1 |
|
Same, but this time with "--max-gpu 2" improves it further: |
Moreover, it outputs a gibberish (for the CPU/GPU config with |


As with graph parallel for Qwen3-Next and the dense Qwen-3.5 models, recurrent attention layers are not parallelized over GPUs.
My guess is that graph parallel will do nothing for hybrid inference.
But for Qwen-3.5-197B-A17B-IQ2_KL fully offloaded on an 8x3090 system, I do observe a small benefit from graph parallel (a.k.a., split mode
graph). This model has only 2 KV attention heads, so using more than 2 GPUs at a time only sows things down. Here are some sweep-bench resultsSplit mode graph (-sm graph --max-gpu 2)
Split mode layer