-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Add llama.cpp GPU offload option #2060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also, this should probably be tested on GPU-less llama builds (I can do this tonight), and a note about installing the GPU version of llama.cpp should be added: https://pypi.org/project/llama-cpp-python/ |
|
Come to think of it, maybe this should be fully automated? Like if an Nvidia GPU and build reqs are present, build llama cublast, otherwise build llama clblast if possible, otherwise default to the regular pip binary. |
How do I do that on Windows? I have updated to 0.150 and everything is working fine but it doesn't seem to have cublast enabled. I tried using "CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python" but I am not 100% sure how I use this command in PowerShell. Do you know? :) |
Try llama-cpp-python won't rebuild if its in pip's cache. |
mayaeary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions.
Co-authored-by: Maya <[email protected]>
Co-authored-by: Maya <[email protected]>
|
Now it's working! Perfect. Omg. This is brilliant! |
|
This is miracle-tier. |
|
It works for me. Not as fast as the old cuda branch gptq-for-llama yet but several times faster than cpu-only. Documentation on the additional installation steps that are required: |
|
For anyone on the One-Click installer on Windows, I had to open a visual studio developer command prompt with build tools installed (typed to set those environment variables for the session, and then I was able to run successfully
|
llama.cpp is my way to run 13B and larger models on my 8GB GPU, but there is a big difference in speed between running within text-generation-webui and running llama.cpp natively. I believe there is something wrong with the speed using the llama-cpp-python api, I don't know if it's something in the API itself or in the implementation, but in some cases the performance can be less than half of the original... See abetlen/llama-cpp-python#181 |
I'm sorry, but i still didn't understand. I tried this command both in prompt and in python and it gives an error with "CMAKE_ARGS" as unrecognized. Where or how is the code "CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python" supposed to be inserted?
I also tried this solution, but i can't find a micromamba-cmd.bat in all the folder. I tried everything, both llama-cpp-python and LLAMA_CUBLAS were successfully installed, yet i can't offload anything to the gpu, i write the command line but the program seems to ignore it. Any help or hint would be appreciated, thank you. |
|
Does anyone have some performance benchmarks? I'm getting 5 tokens/s with q5_1 Vicuna 13B. |
Did you ever find a solution to this? I am in the exact same position. |
You may be on windows, in which case try the approach from this earlier comment: |
requires #2058
Splits models between GPU and CPU, see: ggml-org/llama.cpp#1412