Token Latency decreases from 6s/token to 130ms/token with -ngl set to 1 on Mac M1 Metal GPU. #10638
Replies: 1 comment
-
solved, the magic is the shared memory managed by Metal. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear all,
I am facing a problem that I don't understand, thank you for your warmhearted help here.
My laptop is Mac M1 with 8GB of physical memory, 4 E-cores, 4 P-cores, the commit version of llama.cpp I used is 6374743. I am running Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf on llama.cpp, compiled with Metal GPU.
That's really a magic that, offloading only 1 layer from CPU to GPU can cause 98% decrease in token latency on Metal GPU.
However, things go differently on CUDA GPU. For example:
Everything seems reasonable on CUDA, but on Mac M1 with Metal GPU, set -ngl to 1 is like pressing the "SPEED MODE" button, making the CPUs extremely high performance.
I use ``powermetrics --samplers cpu_power -i 1000" to monitor the CPU usage. P-cores have a high overload, so P-cores have been used by default.
I am now very confusing what the "SPEED MODE" button is, why setting -ngl=1 can speedup inference from 6s/token to 130ms/token?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions