Token Latency decreases from 6s/token to 130ms/token with -ngl set to 1 on Mac M1 Metal GPU. #10638

Lizonghang · 2024-12-03T17:16:25Z

Lizonghang
Dec 3, 2024

Dear all,

I am facing a problem that I don't understand, thank you for your warmhearted help here.

My laptop is Mac M1 with 8GB of physical memory, 4 E-cores, 4 P-cores, the commit version of llama.cpp I used is 6374743. I am running Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf on llama.cpp, compiled with Metal GPU.

When I run it with -ngl set to 0, the token latency is 6~7 seconds per token.
But when I set -ngl to 1, the token latency decreases significantly, from 6~7 seconds to 130ms / token.
When I continue to increase -ngl from 1 to 32, the token latency decreases from 130ms to 90ms / token.

That's really a magic that, offloading only 1 layer from CPU to GPU can cause 98% decrease in token latency on Metal GPU.

However, things go differently on CUDA GPU. For example:

Running on a Ubuntu server with -ngl set to 0 took 77ms / token.
When set -ngl to 1, it took 76ms/token.
When set -ngl to 32, it took 14ms / token.

Everything seems reasonable on CUDA, but on Mac M1 with Metal GPU, set -ngl to 1 is like pressing the "SPEED MODE" button, making the CPUs extremely high performance.

I use ``powermetrics --samplers cpu_power -i 1000" to monitor the CPU usage. P-cores have a high overload, so P-cores have been used by default.

I am now very confusing what the "SPEED MODE" button is, why setting -ngl=1 can speedup inference from 6s/token to 130ms/token?

Thank you!

Lizonghang · 2024-12-06T19:26:54Z

Lizonghang
Dec 6, 2024
Author

solved, the magic is the shared memory managed by Metal.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token Latency decreases from 6s/token to 130ms/token with -ngl set to 1 on Mac M1 Metal GPU. #10638

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Token Latency decreases from 6s/token to 130ms/token with -ngl set to 1 on Mac M1 Metal GPU. #10638

Lizonghang Dec 3, 2024

Replies: 1 comment

Lizonghang Dec 6, 2024 Author

Lizonghang
Dec 3, 2024

Lizonghang
Dec 6, 2024
Author