-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
Description
I am observing a performance regression with the CUDA backend after commit 9c67c27.
I used to generate 48 t/s with TinyLLama1.1 before this commit:
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings: eval time = 1291,42 ms / 63 runs ( 20,50 ms per token, 48,78 tokens per second)
...After commit 9c67c27 I am getting about 36 t/s without flash attention (which is the default):
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings: eval time = 1742,03 ms / 63 runs ( 27,65 ms per token, 36,16 tokens per second)
...With FA enabled I am getting 58 t/s which is great but we shouldn't have this regression with FA disabled.
sorasoras