-
Notifications
You must be signed in to change notification settings - Fork 155
Description
What happened?
tg speed drop after #518 to 4.5 t/s from 5.5t/s after #517 on deepseek r1 iQ3XXS UD. This is when I do not use -rtr. If I use -rtr, pp speed drops from 250t/s to 26t/s also before and also after #518:
./build/bin/llama-sweep-bench
--model /media/ciprian/m2/ai/models/Deepseek-R1-2805-Q3-XXS-UD/DeepSeek-R1-0528-UD-IQ3_XXS-00001-of-00006.gguf
--alias DeepSeek-R1-0528-UD-IQ3_XXS
--ctx-size 71680
-ctk q8_0
-mla 3
-fa
-amb 256
-fmoe -rtr
--temp 0.5
--top_p 0.95
--min_p 0.01
--n-gpu-layers 63
-ot "blk.[0-4].ffn_up_exps=CUDA0,blk.[0-4].ffn_gate_exps=CUDA0,blk.[0-2].ffn_down_exps=CUDA0"
-ot "blk.1[0-2].ffn_up_exps=CUDA1,blk.1[0-2].ffn_gate_exps=CUDA1"
-ot "blk.1[3-4].ffn_up_exps=CUDA2,blk.1[3-4].ffn_gate_exps=CUDA2"
--override-tensor exps=CPU
--parallel 1
--threads 16
--threads-batch 16
--host 0.0.0.0 --port 5002
--ubatch-size 7168 --batch-size 7168 --no-mmap
Name and Version
llama-server, ubuntu, TR 3955wx, 256GB ddr4, 3x3090
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CCACHE=OFF
What operating system are you seeing the problem on?
No response