[Enhancement]: Implement optimizations used in CTranslate2

[CTranslate2](https://github.com/OpenNMT/CTranslate2) is a "competitor" to llama.cpp that advertises itself with:
> ### Fast and efficient execution on CPU and GPU
> The execution [is significantly faster and requires less resources](https://github.com/ggerganov/llama.cpp/issues/new?assignees=&labels=&template=custom.md&title=%5BUser%5D+Insert+summary+of+your+issue+or+enhancement..#benchmarks) than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.

I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement]: Implement optimizations used in CTranslate2 #811

Fast and efficient execution on CPU and GPU

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement]: Implement optimizations used in CTranslate2 #811

Description

Fast and efficient execution on CPU and GPU

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions