Skip to content

Performance regression with CUDA after commit 9c67c277 #7254

@rgerganov

Description

@rgerganov

I am observing a performance regression with the CUDA backend after commit 9c67c27.
I used to generate 48 t/s with TinyLLama1.1 before this commit:

$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings:        eval time =    1291,42 ms /    63 runs   (   20,50 ms per token,    48,78 tokens per second)
...

After commit 9c67c27 I am getting about 36 t/s without flash attention (which is the default):

$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings:        eval time =    1742,03 ms /    63 runs   (   27,65 ms per token,    36,16 tokens per second)
...

With FA enabled I am getting 58 t/s which is great but we shouldn't have this regression with FA disabled.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions