Skip to content

Bug: tg speed drop after https://github.com/ikawrakow/ik_llama.cpp/pull/518 #523

@ciprianveg

Description

@ciprianveg

What happened?

tg speed drop after #518 to 4.5 t/s from 5.5t/s after #517 on deepseek r1 iQ3XXS UD. This is when I do not use -rtr. If I use -rtr, pp speed drops from 250t/s to 26t/s also before and also after #518:
./build/bin/llama-sweep-bench
--model /media/ciprian/m2/ai/models/Deepseek-R1-2805-Q3-XXS-UD/DeepSeek-R1-0528-UD-IQ3_XXS-00001-of-00006.gguf
--alias DeepSeek-R1-0528-UD-IQ3_XXS
--ctx-size 71680
-ctk q8_0
-mla 3
-fa
-amb 256
-fmoe -rtr
--temp 0.5
--top_p 0.95
--min_p 0.01
--n-gpu-layers 63
-ot "blk.[0-4].ffn_up_exps=CUDA0,blk.[0-4].ffn_gate_exps=CUDA0,blk.[0-2].ffn_down_exps=CUDA0"
-ot "blk.1[0-2].ffn_up_exps=CUDA1,blk.1[0-2].ffn_gate_exps=CUDA1"
-ot "blk.1[3-4].ffn_up_exps=CUDA2,blk.1[3-4].ffn_gate_exps=CUDA2"
--override-tensor exps=CPU
--parallel 1
--threads 16
--threads-batch 16
--host 0.0.0.0 --port 5002
--ubatch-size 7168 --batch-size 7168 --no-mmap

Name and Version

llama-server, ubuntu, TR 3955wx, 256GB ddr4, 3x3090

cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CCACHE=OFF

What operating system are you seeing the problem on?

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions