Performance regression with CUDA after commit 9c67c277

I am observing a performance regression with the CUDA backend after commit 9c67c277.
I used to generate 48 t/s with TinyLLama1.1 before this commit:
```bash
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings:        eval time =    1291,42 ms /    63 runs   (   20,50 ms per token,    48,78 tokens per second)
...
```

After commit 9c67c277 I am getting about 36 t/s without flash attention (which is the default):
```bash
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99
...
llama_print_timings:        eval time =    1742,03 ms /    63 runs   (   27,65 ms per token,    36,16 tokens per second)
...
```

With FA enabled I am getting 58 t/s which is great but we shouldn't have this regression with FA disabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression with CUDA after commit 9c67c277 #7254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance regression with CUDA after commit 9c67c277 #7254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions