Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 8, 2024

ref https://twitter.com/awnihannun/status/1777072588633882741

This branch starts from the flash-attention branch (#5021, #6508).

To perform a benchmark for the challenge, run:

# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0

make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1

Current numbers on M2 Ultra:

model size params backend ngl threads test t/s
llama 7B Q4_0 3.79 GiB 7.24 B Metal 99 4 tg 128 102.29 ± 0.07

build: 22df85f (2707)

Base automatically changed from gg/flash-attn-vec to gg/flash-attn April 18, 2024 11:33
@ggerganov ggerganov force-pushed the gg/flash-attn branch 4 times, most recently from 82b282c to ce281b9 Compare April 24, 2024 14:54
@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs performance Speed related topics labels May 10, 2024
@ggerganov ggerganov changed the base branch from gg/flash-attn to master May 13, 2024 07:40
@ggerganov
Copy link
Member Author

We don't support group size of 64 atm (which is what I think MLX uses), so can't make an apples-to-apples comparison with MLX.

@ggerganov ggerganov closed this Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants