-
|
Just migrated from Ollama (Windows) to llama.cpp (Ubuntu 24.04) and need help optimizing. Hardware: i7-14700K + 192Gb @ 5600 MT/s + RTX 5090 (via PCIe) + RTX 4090 (via OCuLink) Model example: GLM-4.5-Air (UD-Q4_K_XL) Prompt processing is suspiciously slow (~21 tokens/sec) while generation is decent (~27 tokens/sec) – but I'm seeing reports of much faster prompt processing on weaker hardware. llama.cpp build: llama-swap config for GLM-4.5-Air log: nvidia-smi output during prompt processing nvidia-smi output during generation |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
|
For small batches, or slow PCIe connections, |
Beta Was this translation helpful? Give feedback.
-
|
What does I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens: My llama-swap config: If the GTX 1660 Super is Device 0, the prompt processing is eternal. But if my RTX 3090 is Device 0, the prompt processing is good. |
Beta Was this translation helpful? Give feedback.
What does
--main-gpu 1do for you? Why does the RTX 4090 display asDevice 0and not the RTX 5090? Could you place the RTX 5090 as device 0 using theCUDA_VISIBLE_DEVICESenv variable to change the order so that the RTX 5090 goes first?I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:
My llama-swap config: